Open Coding Agents: Fast, accessible coding agents that adapt to any repo
January 27, 2026
Ai2
Over the past year, coding agents have transformed how developers write, test, and maintain software. These systems can debug, refactor, and even submit pull requests—fundamentally changing what software development looks like. Yet despite this progress, most coding agents share the same constraints: they're closed, expensive to train, and difficult to study or adapt to private codebases.
Ai2 Open Coding Agents change that. Today we’re releasing not just a collection of strong open coding models, but a training method that makes building your own coding agent for any codebase – for example, your personal codebase or an internal codebase at your organization – remarkably accessible for tasks including code generation, code review, debugging, maintenance, and code explanation.
Closed models haven't seen your internal code, so they don't know it—custom data pipelines, internal APIs, specific org conventions, and so on. Training on your private data teaches them, but generating synthetic training data from private codebases that works for agents has been challenging and cost-prohibitive. Our method makes it easy—reproducing the performance of the previously best open-source model costs ~$400 of compute, or up to $12,000 for performance that rivals the best industry models of the same size. This puts the full recipe within reach for labs and small teams.
Resource constraints drove us to maximize efficiency at every stage, from data quality to inference costs to model selection. The result: we match SWE-smith, a synthetic data method, at 57× lower cost and SkyRL, an open-source reinforcement learning (RL) system, at 26× lower cost.
The first release in our Open Coding Agents family is SERA (Soft-verified Efficient Repository Agents). The strongest – SERA-32B – solves 54.2% of SWE-Bench Verified problems, surpassing prior open-source state-of-the-art coding models of comparable sizes and context lengths while requiring only 40 GPU days (or fewer) to train on a cluster of 2 NVIDIA Hopper GPUs or NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. SERA models are optimized and compatible with Claude Code out of the box. With our fine-tuning method, you can specialize them to your own codebase including your full engineering stack and conventions quickly and at low cost.
We collaborated with NVIDIA to optimize SERA inference for their accelerated infrastructure, ensuring researchers and developers can get the most out of these models in production environments. Early benchmarks are promising: running in BF16 precision on 4xH100 GPUs, SERA achieves approximately 1,950 peak output tokens per second with a 16k context window. At FP8 precision, SERA reaches 3,700 peak output tokens per second—a higher throughput at almost negligible accuracy drop. On next-generation Blackwell 4xB200 systems running in NVFP4, SERA scales further to around 8,600 peak output tokens per second.
Every component of this release is open – models, Claude Code integration, and training recipes – and can be launched with a single line of code, making it easy to use even for those without LLM training experience. We're also releasing state-of-the-art training data so researchers can inspect what worked and push it further, and conduct deep science while avoiding the many stumbling blocks, dead ends, and other roadblocks typical of coding agents.
One result we're especially excited about: SERA uniquely enables adapting to private datasets like internal codebases, and we see evidence that a smaller, open model can replicate and possibly even exceed the performance of a more capable "teacher" coding agent in these setups. For example, SERA-32B can surpass its 110B parameter teacher (GLM-4.5-Air) on codebases like Django and Sympy after training on just 8,000 samples at a cost of $1,300.
Accessible open models can now inherit strong agentic behavior through a simple, reproducible pipeline—no large-scale RL infrastructure or engineering team required. Case in point, SERA was built largely by a single Ai2 researcher.
The challenge: specializing agents to your data
If you’re a small to mid-sized business or independent developer, you probably have code that works with customer data in ways no public model has ever seen. Training on that data would help, but generating agent-ready synthetic data from private codebases has been the hard part. The holy grail would be a method that yields state-of-the-art training data for any codebase, with minimal setup and clear evidence that the tuned model is actually learning agentic behavior versus fragile heuristics.
We tackle this challenge with our new post-training approach that achieves state-of-the-art open-source results on SWE-Bench at a fraction of the typical training costs. Two innovations make it both inexpensive and effective:
- Soft-verified generation (SVG). Synthetic training data generation, which is key to training a strong coding agent, is usually done by generating pairs of code examples that have both incorrect and corrected code. From these examples, the coding agent can learn how to transform incorrect code into correct code by generating a patch with line-by-line code changes. Usually, these examples need to be carefully tested to ensure that they’re actually correct. In SVG, our main finding is that patches don’t need to be correct to be helpful for coding. Just like different code can lead to the same, correct solution, with SVG we generate synthetic training data by having patches that are only partially correct. This removes the need to thoroughly test for full correctness, which in turn alleviates the need for complex infrastructure for testing and costly generation of precise examples. We demonstrate that this soft-verified data scales exactly like "hard-verified" training data.
- Scaling with a bug-type menu. To diversify data without becoming bottlenecked on finding real bugs, we draw from a taxonomy of 51 common bug patterns identified in prior analyses. For each function in a repository, we can generate multiple distinct bug-style prompts—so a repo with thousands of functions can yield tens of thousands of varied agentic trajectories at low cost.
- High simulated workflow fidelity. A key finding is that high-quality synthetic training data should mirror the workflow of a developer rather than the precise details of correct code. This means correct coding data is less important than data that reflects how a developer works on a problem. Combined with SVG, this insight enables repository training: generating training data for any code repository, making it straightforward to scale synthetic data generation massively.
Together, these innovations mean that if you or your organization has a private codebase, you can use SERA to fine-tune a small model to strong performance on your data—easily and affordably. Instead of designing a complicated RL pipeline and test harness for every new task setting, you generate targeted synthetic data and run a straightforward supervised fine-tuning (SFT) job.
State-of-the-art performance, accessible hardware
Using SERA, we've developed a family of models ranging from 8B to 32B parameters, all built on Qwen3 and trained up to 32K context length with the help of various teacher models. We expect the same recipe to keep improving as we scale to larger backbones and context lengths, but the key point is that the current pipeline is already cheap and feasible for anyone to run, customize, and iterate on today—opening up wide access and endless possibilities for future research.
Our efficient technique enabled highly precise science. By keeping costs low, we could systematically disentangle the many factors that have made comparisons between agentic systems unreliable. This rigorous methodology drove rapid iteration, leading us from soft-verified generation to the full SERA approach.
When we align inference conditions for fair comparison, SERA performs competitively with leading open coding agents. At 32K context, SERA-32B achieves 49.5% ± 1.9% on SWE-Bench Verified, comparable to Devstral Small 2 (50.0% ± 1.3%) and GLM-4.5-Air (50.5% ± 1.3%). At 64K context, SERA-32B reaches 54.2% ± 1.4%—competitive with longer-context baselines.
Strong closed-weight coding agents like Devstral Small 2 are an important point of comparison. When we control for key variables, SERA-32B comes close: within ~0.5 points at 32K and ~4.9 points at 64K compared to Devstral Small 2 despite SERA being pure SFT and not trained beyond 32K tokens, both of which disadvantage longer-context evaluation.
We also explored how teacher strength affects results. GLM-4.6 yields our best numbers, but GLM-4.5-Air gets surprisingly close at lower cost. The gap between teachers becomes most meaningful in higher-compute regimes—suggesting that depending on your budget and target performance, a weaker (and cheaper) teacher can be the better overall choice, especially for early iterations.
To validate our synthetic data generation strategy, we tested repository-specific specialization on Django, SymPy, and Sphinx—the three largest repositories in SWE-Bench. Because these have actual test instances, we can quantify how well specialization works in practice. This serves as a proxy for the downstream use case we care most about: adapting to private codebases that may lack comprehensive tests or follow nonstandard structures.
The results are promising. Our specialized models – trained on 8,000 synthetic trajectories per repository – consistently match and often exceed the performance of the 100B+ parameter models we used as teachers. At 32K context, the specialized models achieve 52.23% on Django and 51.11% on SymPy, compared to GLM-4.5-Air's 51.20% and 48.89%. The gains are most pronounced on Django and SymPy, which together account for over 60% of all SWE-Bench problems.
These results highlight two crucial advantages of our method. First, specialization pays off: a 32B model fine-tuned to a specific codebase can match or surpass a 100B+ general-purpose teacher, delivering comparable performance at one-third the size with lower memory requirements, faster inference, and reduced operational costs. Second, simplicity scales: our SFT-only pipeline on an open base model is now competitive with heavily engineered, large-team efforts. Together, these findings lower the barrier to entry for researchers, make results easier to reproduce, and turn agentic coding progress into something the whole community can validate and build on.
Built for developers and researchers
Our release package includes everything needed to reproduce, test, and build on SERA—a lightweight deployment requiring just two lines of code to launch an inference server. We've also developed a setup script and inference optimizations that make SERA directly compatible with Claude Code.
A key difference from closed-weight systems is our commitment to openness and reproducibility:
- We release models, code, all generated agent data, and a full recipe to generate your own data so anyone can reproduce our results or customize them to new domains.
- Our training pipeline is intentionally simple—standard SFT on trajectories with no custom RL infrastructure needed.
- The total cost to reproduce performance levels of the best previous open-source result only is roughly $400 on commodity cloud GPUs, more than 25 times cheaper than many existing approaches that require complex distributed setups and still fall short on performance.
- The total cost to reproduce top open-weight models in industry, such as Devstral Small 2, is only $12,000.
We believe bringing the cost of replicating strong coding agents down to a few hundred dollars will unlock research that simply wasn't possible before. Instead of being limited to a handful of well-funded labs, agentic coding can become a widely accessible practice.
Whether you're running locally on your hardware, deploying in the cloud, or fine-tuning on your own codebase, SERA delivers practical agentic coding within reach of developers, researchers, and small teams alike.
Models | Tech Report | SERA CLI | CLI on PyPi
