Large Language Models (LLMs) - The Batch

Two graphs show TTT-E2E maintains stable loss and latency across increasing context lengths up to 128k.

Machine Learning Research

Learning Long Context at Inference: Test-Time Training End-to-End (TTT-E2E) retrains model weights to handle long inputs

Large language models typically become less accurate and slower when they process longer contexts, but researchers enabled an LLM to keep accuracy stable and inference time constant as its context grew.

Top graph (blue) shows GPT-5 score drop; bottom graph (orange) shows RLM maintaining higher scores.

Machine Learning Research

Context As An External Variable: Recursive Language Models offer path to aramatically expand beyond the context window

When processing long contexts, large language models often lose track of details or devolve into nonsense. Researchers reduced these effects by managing context externally.

The chart compares Nemotron 3 models’ performance in accuracy and processing speed against other AI models.

Machine Learning Research

Open-Source Speed Demon: Nvidia’s open Nemotron 3 Super 120B-A12B model sets new paces in its class

Nvidia, the dominant supplier of AI chips, released a competitive open-source large language model whose speed tops its size class — the first open-weights leader to come from the United States since last year, when Meta delivered Llama 4.

Infographic on mobile AI use in 2025: 149B downloads, $167B revenue, 5.3T usage hours, 3.6 hours/day, 34 apps/month.

Business

AI on Mobile Skyrockets: State of Mobile 2026 Report shows AI chatbot, search, and assistant growth outpaces gaming, social, and more

Downloads of mobile AI apps and resulting revenue are surging.

Table shows GPT-5.4 outperforms in GDPval and Tau2-bench Telecom, setting new state-of-the-art scores.

Machine Learning Research

GPT-5.4’s Higher Performance, Higher Price: OpenAI’s GPT-5.4 Pro and GPT-5.4 Thinking challenge Google’s Gemini 3.1 Pro Preview as best all-around AI model

OpenAI updated its flagship models, extending the ability to use tools and setting the state of the art on a handful of benchmarks, and priced them at the top of the market. Its coding and agentic abilities have enabled Codex, OpenAI’s competitor to Anthropic’s Claude Code, to leap ahead.

Diagram depicts a math problem-solving workflow from problem generation to verification and revision.

Machine Learning Research

Agent Solves Stubborn Math Problems: Google’s Aletheia uses Gemini 3 Deep Think to find original mathematics solutions

LLMs have achieved gold-medal performance in math competitions. An agentic system showed strength in mathematical research as well.

Bar graph depicts rising efficiency in AI models from 2023 to 2025, highlighting energy gains.

Machine Learning Research

Can Local AI Stand In for the Cloud?: Stanford and Together.AI researchers chart edge models’ performance in intelligence per watt

Projected demand for output from large language models is spurring a massive buildout of data centers. Researchers asked whether smaller models running on local devices could meaningfully lighten that load.

A line graph showing S&P Software & Services Index dropping sharply from February 20 to February 24.

Business

Investors Panic Over Agentic AI: Claude Cowork plugins trigger a SaaS stock selloff, but partnerships lead to slight rebound

Makers of software that runs large companies saw their share prices plunge as investors worried that AI systems could undermine their businesses. This week, their stocks rebounded somewhat as Anthropic partnered with some of the same companies.

Two comparison tables show AI model performance across varied benchmarks, highlighting LFM2.5-1.2B.

Machine Learning Research

Faster Reasoning at the Edge: Liquid AI’s small reasoning model mixes attention with convolutional layers for efficiency

Reasoning models in the 1 to 2 billion-parameter range typically require more than 1 gigabyte of RAM to run. Liquid AI released one that runs in less than 900 megabytes, and does it with exceptional speed and efficiency.

Benchmark table shows GLM-5 outperforming other models in reasoning, coding, and general agent tasks.

Machine Learning Research

GLM-5 Scales Up: Z.ai’s updated model boasts top open-weights Intelligence Index score

Z.ai more than doubled the size of its flagship large language model to deliver outstanding performance among open-weights competitors.

A SpaceX rocket hovers in Earth’s atmosphere, representing SpaceX and xAI’s strategic shift toward space-based AI projects.

Business

xAI Blasts Off: SpaceX acquires xAI, announces plans for data centers In space

Elon Musk’s SpaceX acquired xAI, opening the door to richer financing of the merged entity’s AI research, a tighter focus on space applications of AI, and — if Musk’s dreams are realized — solar-powered data centers in space.

A performance table shows Claude Opus 4.6 outperforming competitors in terminal coding, computer use, tool use, search, and problem-solving.

Machine Learning Research

Claude Opus 4.6 Reasons More Over Harder Problems: Anthropic updates flagship model, places first on Intelligence Index

Anthropic updated its flagship large language model to handle longer, more complex agentic tasks.