Large Language Models (LLMs) - The Batch | DeepLearning.AI (Page 3)

Bar charts comparing GPT-5’s speed, price, and intelligence vs. top AI models like Claude, Gemini, and Grok in 2025 benchmarks.

Machine Learning Research

GPT-5 Takeoff Encounters Turbulence: OpenAI's new model hits turbulence with cost, performance, and API complaints

OpenAI launched GPT-5, the highly anticipated successor to its groundbreaking series of large language models, but glitches in the rollout left many early users disappointed and frustrated.

Benchmark comparison of GLM-4.5 models against top AI systems in reasoning, coding, and agentic tasks, showing GLM-4.5’s strong performance in math, tool use, and software engineering.

Machine Learning Research

GLM-4.5, an Open, Agentic Contender: Zhipu AI (Z.ai) releases open-weights GLM-4.5 models that perform comparably to the latest from Claude and DeepSeek

The race is on to develop large language models that can drive agentic interactions. Following the one-two punch of Moonshot’s Kimi K2 and Alibaba’s Qwen3-235B-A22B update, China’s Z.ai aims to one-up the competition.

Table comparing accuracy and Elo scores of gpt-oss-120b and gpt-oss-20b models across benchmarks and tool use.

Machine Learning Research

The Re-Opening of OpenAI: GPT-OSS, OpenAI’s first open-weights models since GPT-2, arrives in 120 billion and 20 billion parameter versions

The “open” is back in play at OpenAI.

Graph showing frequent chatbot users report lower well-being, based on Character.AI usage and survey analysis.

Machine Learning Research

People With AI Friends Feel Worse: Study shows heavy use of AI companions correlates with lower emotional well-being

People who turn to chatbots for companionship show indications of lower self-reported well-being, researchers found.

AlphaEvolve agent loop and performance graph showing superior results on math and ML tasks via code evolution.

Machine Learning Research

Qwen3’s Agentic Advance: Inside Alibaba's new open-weights models, including the 480 billion parameter Qwen3-Coder

Less than two weeks after Moonshot’s Kimi K2 bested other open-weights, non-reasoning models in tests related to agentic behavior, Alibaba raised the bar yet again.

Diagram showing AlphaEvolve’s agentic loop where LLMs iteratively improve code based on human-defined goals and evaluations.

Machine Learning Research

Agentic System for Harder Problems: Google’s AlphaEvolve uses LLMs and evolutionary code to solve complex math and speed up Gemini training

LLMs can struggle with difficult algorithmic or scientific challenges when asked to solve them in a single attempt. An agentic workflow improved one-shot performance on hard problems both theoretical and practical.

Bar chart comparing LLMs like Kimi-K2, GPT-4.1, Claude, and Gemini on coding, tool use, and math benchmarks.

Machine Learning Research

Born To Be Agentic: Moonshot releases Kimi K2, a trillion-parameter model fine-tuned for agentic tool use

An agent’s performance depends not only on an effective workflow but also on a large language model that excels at agentic activities. A new open-weights model focuses on those capabilities.

Chart showing failure modes in multi-agent systems, grouped into failures of specification, coordination, and task verification.

Machine Learning Research

More Robust Multi-Agent Systems: Researchers improve multi-agent systems by studying how they tend to fail

Researchers addressed weaknesses in existing multi-agent frameworks. Their systems achieved scientific and technical breakthroughs.

Grok 4 achieves high benchmarks in reasoning, coding, and science, outperforming Gemini, Claude, and OpenAI models.

Machine Learning Research

Grok 4 Shows Impressive Smarts, Questionable Behavior: Grok 4 launches with benchmark records and idiosyncratic behavior

xAI updated its Grok vision-language model and published impressive benchmark results. But, like earlier versions, Grok 4 showed questionable behavior right out of the gate.

An automated data-generation pipeline for producing web-agent training data. LLMs generate browser-based tasks, then attempt to execute them, and evaluate the results.

Machine Learning Research

Generated Data for Training Web Agents: Researchers scale up production of training data for web agents

Developing an agent that navigates the web can involve a lot of human effort spent annotating training examples to fine-tune the agent’s LLM component. Scientists automated the production of data that fine-tuned LLMs effectively for web tasks.

Email from an LLM blackmailing a coworker, generated during an experiment that tested LLM behavior under pressure.

Machine Learning Research

Good Models, Bad Choices: Anthropic made LLMs choose between failing and misbehaving, and they blackmailed executives.

Top large language models, under experimental conditions that pressed them to choose between abandoning their prompted mission and misbehaving, resorted to harmful behavior, researchers found.

Diagram comparing LLM answers with and without hints. Hints may influence LLM output without being mentioned in reasoning traces.

Machine Learning Research

Reasoning for No Reason: Anthropic finds chain-of-thought reasoning traces may omit key influences

Does a reasoning model’s chain of thought explain how it arrived at its output? Researchers found that often it doesn’t.