Machine Learning Research - The Batch | DeepLearning.AI (Page 16)

QwQ-32B vs DeepSeek-R1 AI model performance benchmark across AIME24, LiveCodeBench, LiveBench, IFEval, and BFCL datasets.

Machine Learning Research

Compact Reasoning: QwQ-32B challenges DeepSeek-R1 and other larger reasoning models

Most models that have learned to reason via reinforcement learning were huge models. A much smaller model now competes with them.

Table comparing Claude 3.7, 3.5, o1, o3-mini, DeepSeek R1, and Grok 3 Beta on reasoning, coding, tools, visuals, and math.

Machine Learning Research

Budget for Reasoning to the Token: Claude 3.7 Sonnet adds extended thinking mode

Anthropic’s Claude 3.7 Sonnet implements a hybrid reasoning approach that lets users decide how much thinking they want the model to do before it renders a response.

Table comparing GPT-4.5, GPT-4o, and o3-mini on GPQA, AIME 2024, MMLU, MMMU, and coding tests.

Machine Learning Research

OpenAI’s GPT-4.5 Goes Big: OpenAI releases GPT-4.5, its most powerful non-reasoning model and maybe its last

OpenAI launched GPT-4.5, which may be its last non-reasoning model.

Table comparing AI models on throughput, HumanEval, MBPP, EvalPlus, MultiPL-E, and code completion.

Machine Learning Research

Text Generation by Diffusion: Mercury Coder uses diffusion to generate text

Typical large language models are autoregressive, predicting the next token, one at a time, from left to right. A new model hones all text tokens at once.

Diagram of Coconut, a method training LLMs to process thought chains as vectors, comparing it to Chain-of-Thought (CoT).

Machine Learning Research

Reasoning in Vectors, Not Text: Meta introduces Chain of Continuous Thought (Coconut) to improve next-token prediction

Although large language models can improve their performance by generating a chain of thought (CoT) — intermediate text tokens that break down the process of responding to a prompt into a series of steps.

A person typing a prompt in an AI-powered mobile app with a button to improve the input.

Machine Learning Research

Mobile Apps to Order: Replit’s agent-powered mobile app expands to full app development

Replit, an AI-driven integrated development environment, updated its mobile app to generate further mobile apps to order.

AI model comparison on reasoning and test-time compute across math, science, and coding benchmarks.

Machine Learning Research

Grok 3 Scales Up: Grok 3, xAI’s new model family, improves on its predecessors, adds reasoning

xAI’s new model family suggests that devoting more computation to training remains a viable path to building more capable AI.

Diagram showing GPT-4o with and without search, highlighting task execution success and failure.

Machine Learning Research

Tree Search for Web Agents: How tree search improves AI agents’ ability to browse the web and complete tasks

Browsing the web to achieve a specific goal can be challenging for agents based on large language models and even for vision-language models that can process onscreen images of a browser.

AI model leaderboard comparing performance across tasks like math, vision, and document analysis.

Machine Learning Research

Alibaba’s Answer to DeepSeek: Alibaba debuts Qwen2.5-VL, a powerful family of open vision-language models

While Hangzhou’s DeepSeek flexed its muscles, Chinese tech giant Alibaba vied for the spotlight with new open vision-language models.

ChatGPT interface drafting a research report on retail trends, including AI, e-commerce, and inflation.

Machine Learning Research

Agents Go Deep: OpenAI’s Deep Research agent generates detailed reports by analyzing web sources

OpenAI introduced a state-of-the-art agent that produces research reports by scouring the web and reasoning over what it finds.

Diagram illustrating Moshi’s use of an LLM to process user audio input, inner monologue, and output.

Machine Learning Research

Okay, But Please Don’t Stop Talking: Moshi, an open alternative to OpenAI’s Realtime models for speech

Even cutting-edge, end-to-end, speech-to-speech systems like ChatGPT’s Advanced Voice Mode tend to get interrupted by interjections like “I see” and “uh-huh” that keep human conversations going. Researchers built an open alternative that’s designed to go with the flow of overlapping speech.

Line charts showing performance improvements in math and science with 2.0 Flash Thinking models.

Machine Learning Research

Gemini Thinks Faster: Google’s Gemini 2.0 Flash Thinking advances in reasoning, outperforms DeepSeek-R1

Google updated the December-vintage reasoning model Gemini 2.0 Flash Thinking and other Flash models, gaining ground on OpenAI o1 and DeepSeek-R1.