Benchmarks - The Batch | DeepLearning.AI

A line graph compares SWE-Bench Pro and DeepSWE, showing various models' performance percentages.

Machine Learning Research

Agentic Tests Beyond the Bug Hunt: DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

SWE-bench, a family of benchmarks that focuses on an LLM’s ability to fix software bugs, is giving way to new tests that evaluate agent software-engineering performance in more challenging ways.

Bar chart shows Claude Fable 5's fallback rates, with ProgramBench at 100% and others varying.

Machine Learning Research

Claude Fable 5’s Benchmark Problems: Independent tests of Claude Fable 5 run into Anthropic's protective policies

Before Anthropic pulled its latest Claude models from circulation, even professional testers couldn’t readily tell whether they were getting a Mythos-class model or a lesser version under the same name.

The chart compares AI benchmark efforts with employment and capital in U.S. job sectors, highlighting discrepancies.

Machine Learning Research

Toward Agent Benchmarks That Reflect Human Work: AI agents may not be getting better at full range of economically valuable labor

AI agents seem to be increasingly capable of performing economically valuable tasks, but current benchmarks measure this capability only narrowly.

GPT-5.5 leads in Terminal-Bench 2.0 with 82.7% score, highlighting performance contrast against competitors.

Machine Learning Research

GPT-5.5 Outperforms, Hallucinates: OpenAI’s latest model tops leaderboards for coding, visual puzzles, and overall intelligence

The latest update of OpenAI’s flagship model sets new states of the art in important benchmarks but has difficulty distinguishing between what it does and doesn't know.

Alibaba's latest flagship models are open-weights MoE performers in sizes from less than 1B parameters

Machine Learning Research

Qwen3.5 Outperforms Bigger Models, Leads Vision Benchmarks: Alibaba’s latest flagship models are open-weights MoE performers in sizes from less than 1B parameters

The Qwen3.5 family of open-weights vision-language models includes impressive larger models as well as a smaller one that outperforms an OpenAI open-weights model 10 times its size.

AI models’ performance shown in bars; GPT-5.2 highest at 51, reflecting updated benchmarks.

Machine Learning Research

Artificial Analysis Revamps Intelligence Index: Independent AI testing authority turns from saturated knowledge benchmarks to harder business tests

Artificial Analysis, which tests AI systems, updated the component evaluations in its Intelligence Index to better reflect large language models’ performance in real-world use cases.

A table compares GPT-5.2's benchmark scores to Claude Opus 4.5 and Gemini 3 Pro in various reasoning tasks.

Machine Learning Research

OpenAI’s Answer to Gemini 3: GPT-5.2 arrives, touting variable reasoning and coding performance

OpenAI launched GPT-5.2 only weeks after its CEO Sam Altman reportedly issued a “code red” alarm in response to Google's Gemini 3.

Flowchart showing Tiny Recursive Model process with stages: input, prediction, and latent refinement.

Machine Learning Research

Small Models Solve Hard Puzzles: Tiny Recursive Model beats larger competitors at games like Sudoku and Maze

Large language models often fail at puzzles like Sudoku, for which a solution includes multiple elements and a single mistake invalidates all of them. Researchers showed that a tiny network, by repeatedly refining its solution, can solve this sort of puzzle well.

Table highlights Opus 4.5’s superior scores in coding and reasoning compared to other AI models.

Machine Learning Research

Claude Does More With Fewer Tokens: Claude Opus 4.5 retakes the coding crown at one-third the price of its predecessor

Claude Opus 4.5, the latest version of Anthropic’s flagship model, extends the earlier version’s strengths in coding, computer use, and agentic workflows while generating fewer tokens.

Table shows Gemini 3 Pro leading in benchmarks, outperforming Gemini 2.5, Claude Sonnet 4.5, and GPT-5.1.

Machine Learning Research

Google Dominates Arena Leaderboards (For the Moment): Gemini 3 Pro and Nano Banana Pro boast best-in-class multimodal reasoning and image generation

Google introduced Gemini 3 Pro and Nano Banana Pro, its flagship vision-language and image-generation models, and deployed them to billions of users worldwide.

Series of graphs transformed via tokenization and transformer layers, resulting in predicted outputs.

Machine Learning Research

Forecasting Multiple Time Series: Amazon’s Chronos-2 sorts out tangled variables to make better predictions

Transformers are well suited to predicting future values of time series like energy prices, wages, or weather, but often — as in those examples — multiple time series often influence one another. Researchers built a model that can forecast multiple time series simultaneously.

Energy-Based Transformer refines predictions step by step, lowering energy for higher context compatibility.

Machine Learning Research

Transformers Energized: Energy-Based Transformers (EBTs) use gradient descent to gradually predict the next token

A new type of transformer can check its work. Instead of guessing the next output token in one shot like a typical transformer, it starts with a rough version of the token and improves it step by step.