Benchmarks - The Batch | DeepLearning.AI (Page 2)

AI benchmark comparison chart showing Gemini 2.5 Pro, GPT-4.5, Claude, Grok, and others across science, math, code, and reasoning.

Machine Learning Research

Google Unveils Gemini 2.5: Google’s Gemini 2.5 Pro Experimental outperforms top AI models

Google’s new flagship model raised the state of the art in a variety of subjective and objective tests.

TabPFN neural network diagram showing synthetic training, prediction on real-world tabular data, and attention layers.

Machine Learning Research

Better Than Trees for Tabular Data: Transformers can outperform decision trees at predicting unlabeled spreadsheet cells

If you have a collection of variables that represent, say, a cancer patient and you want to classify the patient’s illness as likely cancer or not, algorithms based on decision trees, such as gradient-boosted trees, typically perform better than neural networks.

Table comparing Claude 3.7, 3.5, o1, o3-mini, DeepSeek R1, and Grok 3 Beta on reasoning, coding, tools, visuals, and math.

Machine Learning Research

Budget for Reasoning to the Token: Claude 3.7 Sonnet adds extended thinking mode

Anthropic’s Claude 3.7 Sonnet implements a hybrid reasoning approach that lets users decide how much thinking they want the model to do before it renders a response.

Table comparing GPT-4.5, GPT-4o, and o3-mini on GPQA, AIME 2024, MMLU, MMMU, and coding tests.

Machine Learning Research

OpenAI’s GPT-4.5 Goes Big: OpenAI releases GPT-4.5, its most powerful non-reasoning model and maybe its last

OpenAI launched GPT-4.5, which may be its last non-reasoning model.

Diagram of Localize-and-Stitch merging fine-tuned models by combining critical weights into one model.

Machine Learning Research

Better Performance From Merged Models: Localize-and-Stitch improves methods for merging and fine-tuning multiple models

Merging multiple fine-tuned models is a less expensive alternative to hosting multiple specialized models. But, while model merging can deliver higher average performance across several tasks, it often results in lower performance on specific tasks. New work addresses this issue.

o1 Family Benchmarks comparing pass rates across AIME, Codeforces, and GPQA.

Machine Learning Research

Higher Reasoning: OpenAI debuts o1 and pro mode for $200/month

OpenAI launched not only its highly anticipated o1 model but also an operating mode that enables the model to deliver higher performance — at a hefty price.

MLE-Bench workflow showing competition steps for model training, testing, and leaderboard scoring.

Machine Learning Research

When Agents Train Algorithms: OpenAI’s MLE-bench tests AI coding agents

Coding agents are improving, but can they tackle machine learning tasks?

COMPL-AI workflow diagram showing compliance steps for AI models under the EU AI Act.

Tech & Society

Does Your Model Comply With the AI Act?: COMPL-AI study measures LLMs’ compliance with EU’s AI act

A new study suggests that leading AI models may meet the requirements of the European Union’s AI Act in some areas, but probably not in others.

Cartoon of a ghost helping a professor answer Halloween trivia questions on a chalkboard, with students watching.

Machine Learning Research

Benchmark Tests Are Meaningless: The problem with training data contamination in machine learning

The universe of web pages includes correct answers to common questions that are used to test large language models. How can we evaluate new models if they’ve studied the answers before we give them the test?

Comparison table of pre-trained models like Mistral, Llama, and Gemma, showcasing performance across evaluation metrics.

Machine Learning Research

Mistral AI Sharpens the Edge: Mistral AI unveils Ministral 3B and 8B models, outperforming rivals in small-scale AI

Mistral AI launched two models that raise the bar for language models with 8 billion or fewer parameters, small enough to run on many edge devices.

Machine Learning Research

Models Ranked for Hallucinations: Measuring language model hallucinations during information retrieval

How often do large language models make up information when they generate text based on a retrieved document? A study evaluated the tendency of popular models to hallucinate while performing retrieval-augmented generation (RAG).

Tech & Society

Image Generators in the Arena: Text-to-image generators face off in arena leaderboard by Artificial Analysis

An arena-style contest pits the world’s best text-to-image generators against each other.