Agents Write Code Faster, Cheaper: Software developers used more versatile AI-powered tools to write code
Coding apps moved beyond autofill-style code completion to agentic systems that manage a wide range of software development tasks.
Coding apps moved beyond autofill-style code completion to agentic systems that manage a wide range of software development tasks.
What happened: Coding emerged as the application of agentic workflows with the most immediate business value. Claude Code, Google Gemini CLI, OpenAI Codex and other apps turned coding agents into one of Big AI’s fiercest competitive battlegrounds. Smaller competitors developed their own agentic models to remain in the game.
Driving the story: When Devin, the pioneering agentic code generator, arrived in 2024, it raised the state of the art on the SWE-Bench benchmark of coding challenges from 1.96 percent to 13.86 percent. In 2025, coding agents that use the latest large language models routinely completed more than 80 percent of the same tasks. Developers embraced increasingly sophisticated agentic frameworks that enable models to work with agentic planners and critics, use tools like web search or terminal emulation, and manipulate entire code bases.
- When reasoning models arrived late in 2024, they immediately boosted coding power and cut costs, as reasoning enabled agents to map out tasks to be completed by less expensive models. The addition of variable reasoning budgets made it easier for agents to use a single model, devoting more tokens to planning and fewer to making simple edits. By the end of 2025, Gemini 3 Pro, Claude Opus 4.5, and GPT-5.2 emerged as top models for coding and agentic workflows.
- Open-weights models quickly followed. Z.ai GLM-4.5 and Moonshot Kimi K2 emerged as open-weights favorites, enabling automated-coding startups to cut their costs dramatically. Released in July, Qwen3-Coder provided a hefty 480 billion-parameter model trained on over 5 trillion tokens of code, nearly matching the performance of Claude Sonnet 4.
- Anthropic wrapped an agentic framework around Claude to create an application: Claude Code. Introduced in February, Claude Code was an instant hit and set expectations for what agentic coding systems should do. OpenAI responded with its Codex application based on coding-specialized versions of its GPT-5 series. Where Claude Code initially ran locally, the Codex app ran in a browser, helping to popularize coding agents that run in the cloud. By the end of the year, these agents were able to manage longer-running problems using multiple sub-agents — typically an initializer to start tasks and track progress and various coding agents to complete different tasks — each with its own context window.
- A tug-of-war between model makers and developers of integrated development environments (IDEs) led popular IDE providers like Anysphere (Cursor) and Cognition AI (Windsurf) to build their own models. Conversely, Google built its own IDE, Antigravity, which debuted in November.
Behind the news: Agentic systems steadily ratcheted up the state of the art on the popular SWE-Bench coding benchmark, and researchers looked for alternate ways to evaluate their performance.
- These efforts led to SWE-Bench Verified, SWE-Bench Pro, LiveBench, Terminal-Bench, 𝜏-Bench, CodeClash, and others.
- Because different providers trust (or cherry-pick) different benchmarks, it has become more difficult to evaluate agents’ performance. Choosing the right agent for a particular task remains a challenge.
Yes, but: At the beginning of 2025, most observers agreed that agents were good for generating run-of-the-mill code, documentation, and unit tests, but experienced human engineers and product managers performed better on higher-order strategic problems. By the end of the year, companies reported automating senior-level tasks. Microsoft, Google, Amazon, and Anthropic said they were generating increasing quantities of their own code.
Where things stand: In a short time, agentic coding has propelled vibe-coding from puzzling buzzword to burgeoning industry. Startups like Loveable, Replit, and Vercel enable users who have little or no coding experience to build web applications from scratch. While some observers worried that AI would replace junior developers, it turns out that developers who are skilled at using AI can prototype applications better and faster. Soon, AI-assisted coding may be regarded as simply coding, just as spellcheck and auto-complete are part of writing.