Stanford and Alibaba Build Bug Fixing Dataset and Pipeline to Train AI

A bottleneck in fine-tuning large language models for software engineering is building a dataset that can show them how to edit code, search for subroutines, write test scripts, control a terminal, manage a file system, and so on. Researchers built a pipeline that produces such data automatically.

What’s new: John Yang and colleagues at Stanford, Princeton, and Alibaba introduced SWE-smith, a method that generates realistic examples of bug fixes and other code alterations. The code, dataset, and a model that was fine-tuned on the data are freely available for commercial and noncommercial uses.

Key insight: Automated unit tests determine whether code does what it’s supposed to do. Code that doesn’t pass a unit test has a bug, so one way to generate bug-fix examples is to start with code that passes a unit test and modify it until it doesn’t. Another is to start with working code and revert to previous versions that contain bugs or lack desired features. Having introduced issues, we can prompt an LLM to eliminate them, producing valid before-and-after examples that don’t require manual validation.

How it works: The authors started with 128 GitHub repositories of Python code.

For each repository, the authors automatically built a Docker execution environment using SWE-agent, an open-source software engineering agent they built in earlier work.
They synthesized bugs via four methods: (i) OpenAI o3-mini introduced bugs into functions or classes, (ii) a custom program altered code procedurally; for example, deleting loops or switching the order of lines, (iii) the authors combined these bugs to create more complex problems, and (iv) they reverted pull requests to re-introduce bugs and remove features from earlier versions of the code.
They validated bugs by running unit tests and kept examples in which the buggy code failed one or more tests.
To generate examples of multi-step bug fixes, they prompted SWE-agent using Claude 3.5 Sonnet, Claude 3.7 Sonnet, or GPT-4o to fix the bugs over several steps.

Results: The authors fine-tuned Qwen 2.5 Coder-32B on 5,000 examples, focusing on the bugs produced by methods (i) and (iv) above, which they found most effective. To represent a diversity of bugs, they kept no more than 3 example fixes for any given bug. Paired with SWE-agent, their model solved software engineering problems in SWE-bench Verified in one attempt 40.2 percent of the time. Paired with the OpenHands agentic framework, the same-size R2E-Gym-32B (fine-tuned on different data) and the much bigger Qwen3-235B-A22B (not fine-tuned) solved 34.4 percent in one attempt.

Why it matters: Previous datasets for fine-tuning LLMs on coding tasks are small, often comprising thousands of training instances from less than a dozen repositories. The authors’ method can produce such data at scale, potentially enabling major developers to improve their AI-assisted coding models and everyone else to build better systems.

We’re thinking: AI-assisted coding is revolutionizing software development, and the tools are still evolving. The ability to produce effective training data at scale is likely to further accelerate the progress — already moving at breakneck speed! — in this area.