The Batch | DeepLearning.AI (Page 2)

Home office scene with individual using AI photography app and graphic design software for creative projects.

U.S. chatbot use passes 50 percent: AA-Briefcase benchmark measures knowledge work

ARD, an open spec for discovery. North Mini Code gains traction. Security experts criticize U.S. government. Apple Intelligence, beyond Siri.

Flowchart illustrates the POPE method, transitioning from guided to unguided problem-solving in reinforcement learning.

Reinforcement Learning With Hints: Privileged On-Policy Exploration (POPE) trains models to expand on partial solutions

Reinforcement learning can’t train a model to solve a difficult problem if the model doesn’t discover all the right steps.

Performance table shows Nemotron's scores across benchmarks, highlighting its strengths and weaknesses.

Nvidia’s Nemotron Goes Big: Nvidia Nemotron 3 Ultra bets on speed and openness to win customers

Nvidia’s largest-yet model is among the best-performing from a developer based in the U.S. and among the most open developed by anyone.

A line graph compares SWE-Bench Pro and DeepSWE, showing various models' performance percentages.

Agentic Tests Beyond the Bug Hunt: DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

SWE-bench, a family of benchmarks that focuses on an LLM’s ability to fix software bugs, is giving way to new tests that evaluate agent software-engineering performance in more challenging ways.

Bar chart shows Claude Fable 5's fallback rates, with ProgramBench at 100% and others varying.

Claude Fable 5’s Benchmark Problems: Independent tests of Claude Fable 5 run into Anthropic's protective policies

Before Anthropic pulled its latest Claude models from circulation, even professional testers couldn’t readily tell whether they were getting a Mythos-class model or a lesser version under the same name.

Cartoon map illustrating global collaboration in research, open source technology, infrastructure, secure data sharing.

Open Platforms Beat Power Plays

Over the last two weeks, both the U.S. Government and Anthropic took significant actions that demonstrated their power to control access to AI by restricting what others can do with frontier models.

Testing Mythos and Fable, Moving Beyond SWE-bench, Nvidia's Open Contender

The Batch AI News and Insights: Over the last two weeks, both the U.S. Government and Anthropic took significant actions that demonstrated their power to control access to AI by restricting what others can do with frontier models.

Workspace features map creation, coding; showing synergy between traditional planning and digital strategy.

Zhipu’s GLM-5.2 is the new top open model: SpaceX buys coding favorite Cursor

OpenRouter’s model mix-and-match. Subject expertise trumps software skills. OpenAI loses share to Google, Anthropic. Google ruled liable for AI mistakes.

Female data analyst uses TranslateLLM tool to convert natural language to SQL in modern office setting.

DiffusionGemma breaks 1,000 tokens/second: Anthropic closes Fable and Mythos after US intervention

Claude Fable 5 no longer silently degrades. Hermes Agent maker streamlines setup. Agents’ Last Exam pushes top models. Gemini-SQL2 translates database queries.

Diagram illustrates LLMs processing state-coordinated media, affecting linguistic responses and predictions.

State Media Influences LLM Responses: Significant portions of AI training material reflect national propaganda

Popular large language models have adopted the biases of governments that control the free flow of information, particularly when those models generate output in the languages of countries where such governments are in power, researchers found.

Bar chart shows a sharp rise in code output per person after Claude Code's release, reaching 8x by 2026.

RSI Is the New AGI: What Is recursive self-improvement, and why Is everybody talking about it?

The phrase recursive self-improvement erupted on social media following an Anthropic report that tracked AI-driven gains in the company’s internal software-engineering productivity.

Chart compares performance of Composer 2.5 against Opus 4.7, GPT-5.5, and Composer 2 in benchmarks.

Cursor Fits Its Model to Its Agent: Composer 2.5 for Cursor rivals GPT-5.5's coding abilities at lower price

Cursor’s latest software engineering model rivals the performance of leading competitors like Claude Opus 4.7 and GPT 5.5 for a fraction of the price.

Claude Mythos 5 excels, achieving top scores in agentic coding and cybersecurity compared to rivals.

Machine Learning Research

Behold Mythos!: Anthropic released Claude Mythos 5 and Claude Fable 5, a public version with safeguards

After months of headlines that teased a large language model with extraordinary capabilities, Anthropic launched Claude Mythos 5, which can crack software previously believed to be secure, and Claude Fable 5, a version for general use that limits what users can do in an unprecedented way.

Executing a sales report analysis for Q1 2026, focusing on revenue trends and product performance via desktop tools.

Letters

Agents on the Desktop

If you haven’t already, I encourage you to experiment with using AI agents not just to chat but to actually do work for you on your desktop.

The Batch Newsletter

Mythos Begets Fable, Cursor's Composer 2.5, Agents Building Agents

The Batch AI News and Insights: If you haven’t already, I encourage you to experiment with using AI agents not just to chat but to actually do work for you on your desktop.

Blending history with technology, a Greek marketplace scene with a lyre player, kids, drone, and robot assistant.

Data Points

Claude Fable 5, or Mythos for the masses: Apple rethinks mixture-of-experts to save local memory

Google’s voice translation model covers 70+ languages. OpenAI’s preliminary public-offering paperwork. NotebookLM, now powered by Gemini 3.5 agents. FrontierCode, a new code-quality benchmark from Cognition.

Scientists collaborate in a lab, using pipettes and computers for vaccine development research, wearing lab coats.

Data Points

Apple builds local/cloud models with Google: Gemma 4 12B, a laptop-sized model with multimodal power

The first working vaccine built by AI. Kimi CLI, Moonshot’s software engineering agent. The White House’s plans for an OpenAI stake. OpenJarvis, an open-source agent that learns on-device.

Flowchart shows book text split, input summary, model training, and memorization testing in LLM workflow.

Machine Learning Research

Fine-Tuning LLMs to Expand on Summaries Unearths Pretraining Texts: Fine-Tuning can strip models of copyright alignment guidelines

Fine-tuning large language models on a seemingly benign task that would be useful to writers — expanding plot summaries into paragraphs of polished fiction — causes them to regurgitate substantial portions of books on which they were pretrained.

Comments highlight using Singapore nodes for AI access, tying to gray market themes discussed.

Tech & Society

Inside the Gray Market for LLM Access: Middlemen package extra tokens, hijack IDs to resell, distill models

An ecosystem of API proxy servers enables AI developers in China to access top U.S. models at deeply discounted prices.

Animated green radar sweeps in a circular motion, highlighting data detection and tracking on a black screen.

Tech & Society

How AI is Saving Whales: WhaleSpotter pairs sensors with AI algorithms to detect marine mammals

An AI-powered network of thermal sensors is helping ships avoid collisions with whales.

Flowchart depicting LLMs memorizing and responding to state media, affecting language-specific outputs.

Machine Learning Research

Qwen3.7-Max Adds Speed and Power: Alibaba's latest proprietary model challenges U.S. rivals

Alibaba updated its flagship large language model for long-running agentic work, pushing it into the top rank among LLMs built in China.

In a modern salon, AI-powered robot stylist braids woman's hair, emphasizing technology in personal care.

Letters

AI Regulations Must Balance Innovation and Risk

There have been intense efforts over the past few years to lobby governments to pass AI laws for regulatory capture or to suppress open source.

The Batch Newsletter

Qwen3.7-Max Challenges Google for Third Place, AI Saves Whales, Fine-Tuning Breaks Copyright Alignment

The Batch AI News and Insights: There have been intense efforts over the past few years to lobby governments to pass AI laws for regulatory capture or to suppress open source.

Robots inspect network data with magnifying glasses, showcasing AI's role in data analysis and query processing.

Data Points

Microsoft fully trains its own models: Copilot app brings agent management to desktop

How agents think about search. Hermes now a multi-platform desktop app. Qwen3.7-Plus, Alibaba’s midsized cloud model. OpenAI’s latest plugins for Codex.

Latest

U.S. chatbot use passes 50 percent: AA-Briefcase benchmark measures knowledge work

Reinforcement Learning With Hints: Privileged On-Policy Exploration (POPE) trains models to expand on partial solutions

Nvidia’s Nemotron Goes Big: Nvidia Nemotron 3 Ultra bets on speed and openness to win customers

Agentic Tests Beyond the Bug Hunt: DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

Claude Fable 5’s Benchmark Problems: Independent tests of Claude Fable 5 run into Anthropic's protective policies

Open Platforms Beat Power Plays

Testing Mythos and Fable, Moving Beyond SWE-bench, Nvidia's Open Contender

Zhipu’s GLM-5.2 is the new top open model: SpaceX buys coding favorite Cursor

DiffusionGemma breaks 1,000 tokens/second: Anthropic closes Fable and Mythos after US intervention

State Media Influences LLM Responses: Significant portions of AI training material reflect national propaganda

RSI Is the New AGI: What Is recursive self-improvement, and why Is everybody talking about it?

Cursor Fits Its Model to Its Agent: Composer 2.5 for Cursor rivals GPT-5.5's coding abilities at lower price

Behold Mythos!: Anthropic released Claude Mythos 5 and Claude Fable 5, a public version with safeguards

Agents on the Desktop

Mythos Begets Fable, Cursor's Composer 2.5, Agents Building Agents

Claude Fable 5, or Mythos for the masses: Apple rethinks mixture-of-experts to save local memory

Apple builds local/cloud models with Google: Gemma 4 12B, a laptop-sized model with multimodal power

Fine-Tuning LLMs to Expand on Summaries Unearths Pretraining Texts: Fine-Tuning can strip models of copyright alignment guidelines

Inside the Gray Market for LLM Access: Middlemen package extra tokens, hijack IDs to resell, distill models

How AI is Saving Whales: WhaleSpotter pairs sensors with AI algorithms to detect marine mammals

Qwen3.7-Max Adds Speed and Power: Alibaba's latest proprietary model challenges U.S. rivals

AI Regulations Must Balance Innovation and Risk

Qwen3.7-Max Challenges Google for Third Place, AI Saves Whales, Fine-Tuning Breaks Copyright Alignment

Microsoft fully trains its own models: Copilot app brings agent management to desktop