Introducing CodeScientist: A step toward automated scientific discovery
Peter Jansen / March 31, 2025
CodeScientist is a new AI system designed to push the boundaries of autonomous scientific discovery. Scientists in all fields have long dreamed of a computer system capable of identifying gaps in scientific knowledge and automatically generating, testing and validating new experiments to help fill them. This is the spirit in which we have been developing a new system just unveiled in the paper, “CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation.”
In large-scale experiments focused on AI agents and virtual environments, CodeScientist produced 19 potential discoveries, six of which were validated by human experts as meeting the threshold for scientific soundness and novelty. These findings, while incremental in nature, signal a shift toward broader, more creative AI-driven scientific exploration, as some of the first scientific discoveries made by an autonomous AI system.
While autonomous scientific discovery systems still face challenges—such as ensuring reliability and evaluating discoveries effectively—CodeScientist is a promising next step.
How does it work?
To start the process, a human scientist will provide CodeScientist a number of papers, and some short examples of code blocks that the system can utilize as it creates its experiments.
CodeScientist builds experiments like constructing a house—but instead of wood and bricks, it uses these "codeblocks" to handle common tasks like calling language models, creating plots, and running statistical analyses. Other scientific discovery systems tend to either modify a finished house or start from nothing, which can lead to easily avoidable mistakes in basic tasks like calling an LLM. Our "brick and lumber" approach lets CodeScientist focus on experiment design without getting bogged down in the details.
CodeScientist follows a five-step workflow:
- Ideation: The system reviews a curated set of research papers and a library of vetted code snippets. It then produces a range of candidate ideas from new tasks and metrics to different agent designs.
- Planning: Each idea is converted into a detailed plan. This plan outlines the experimental design and identifies the code blocks needed to implement the idea.
- Experiment Construction & Execution: The system uses an iterative "generate–execute–reflect" cycle to write Python code, execute experiments in a controlled environment, and debug the code until it looks like the experiments are working correctly.
- Reporting: After an experiment runs successfully, CodeScientist automatically creates a report that summarizes the experimental results, and whether the hypothesis was supported, rejected, or remains inconclusive.
- Meta-Analysis: To address the variability in the results, the system repeats each experiment several times and compares the outcomes to assess their reliability.
Discoveries in practice
To test CodeScientist, we gave it about 50 papers broadly in the area of agents and virtual environments, and had it generate a long list of potential experiments. A researcher picked out the 50 most promising ideas, and CodeScientist generated and ran these experiments. CodeScientist flagged 19 experiments as potential discoveries. We carefully reviewed both the experiment reports, results, and code that CodeScientist generated, and found that six of these appear to meet a minimum bar for scientific soundness and incremental novelty. Here are examples of its six proposed discoveries:
- World Modeling Confidence: When using a language model as a simulation of the world, the system discovered that the model's self-assessed confidence in its modeling predictions has a low correlation with its actual accuracy.
- Simpler Representations Improve Prediction in World Modeling: Experiments indicated that when the AI uses simpler state representations (like binary values) rather than complex text, its predictions tend to be more accurate.
- Stepwise Virtual Environment Generation Benefits Fidelity: Creating virtual environments in multiple stages (that focus on specific components) resulted in higher quality benchmarks compared to generating an environment in a single pass.
- Difficulty with Combinatorial Optimization: CodeScientist observed that language models struggle with problems that require selecting the best combination of elements to approximate a target value—a task it grounded in electronics, and substituting one resistor with several others.
- Action Prediction Challenges: In simulated environments, the system found that the AI's ability to predict whether an action will be successful is only moderately better than random guessing when it has limited information to make this decision with.
- Graph-Based Memory Enhances Performance: A meta-scientific discovery: An experiment showed that augmenting an AI agent with a graph-based memory improves its performance when playing a game about scientific discovery called DiscoveryWorld.
For more detailed information on these discoveries, please see our CodeScientist Papers.
While these discoveries are quite incremental in their novelty, and modest in their technical sophistication, they suggest that the automated scientists of tomorrow might discover progressively more novel and useful research results with a minimum of human effort.
Addressing the challenges
CodeScientist faces several challenges:
- Human Effort: While CodeScientist can be run fully automatically, it appears that adding human effort in key places—such as selecting the most promising ideas, and spending a few moments to comment on ideas and nudge it in slightly better directions—helps it be a much more efficient scientist.
- Improving Experiment Generation: Using language models to generate code for arbitrary experiments is still very challenging. More than half of CodeScientist's experiments end in failure due to debugging challenges rather than inconclusive scientific results.
- Methodological Rigor: Language models are currently like junior scientists, still learning the best practices for doing research. Currently a human expert still needs to manually review the papers, much like peer review for scientific articles, to make sure the language model didn't cut corners.
Despite these challenges, CodeScientist serves as a useful starting point. It demonstrates both the potential and the complexities of automating scientific discovery.
Looking ahead
CodeScientist is not a complete solution for automated scientific discovery; its findings are mostly incremental, and build on a number of recent innovations in this space, including The AI Scientist, Agent Laboratory, and data-to-paper. However, it represents an important step toward a more automated research process—particularly one where the research explores broader domains of science instead of focusing on solutions to highly specific problems (e.g., protein folding for AlphaFold). With further development, these systems might be able to address more complex questions across various scientific fields and lead to truly impactful results.
CodeScientist is open source, and we encourage researchers and developers to explore, contribute, and help refine this approach. By making automated scientific discovery more accessible, we hope to support continued innovation in research.
Interested in learning more or contributing? Visit the CodeScientist project on GitHub.
Conclusion
CodeScientist is an early but promising contribution in the effort to automate the process of scientific discovery. By combining literature-based ideation with code-based experimentation, it provides a foundation for future systems that could help drive research forward. While there is still much work to be done, this project represents a modest step toward integrating automation into scientific practice.
Read the full paper on arXiv.
If you have feedback or collaboration ideas in this area, reach out to press@allenai.org.