U.S. chatbot use passes 50 percent: AA-Briefcase benchmark measures knowledge work
ARD, an open spec for discovery. North Mini Code gains traction. Security experts criticize U.S. government. Apple Intelligence, beyond Siri.
ARD, an open spec for discovery. North Mini Code gains traction. Security experts criticize U.S. government. Apple Intelligence, beyond Siri.
Reinforcement learning can’t train a model to solve a difficult problem if the model doesn’t discover all the right steps.
Nvidia’s largest-yet model is among the best-performing from a developer based in the U.S. and among the most open developed by anyone.
SWE-bench, a family of benchmarks that focuses on an LLM’s ability to fix software bugs, is giving way to new tests that evaluate agent software-engineering performance in more challenging ways.
Before Anthropic pulled its latest Claude models from circulation, even professional testers couldn’t readily tell whether they were getting a Mythos-class model or a lesser version under the same name.
Over the last two weeks, both the U.S. Government and Anthropic took significant actions that demonstrated their power to control access to AI by restricting what others can do with frontier models.
The Batch AI News and Insights: Over the last two weeks, both the U.S. Government and Anthropic took significant actions that demonstrated their power to control access to AI by restricting what others can do with frontier models.
OpenRouter’s model mix-and-match. Subject expertise trumps software skills. OpenAI loses share to Google, Anthropic. Google ruled liable for AI mistakes.
Claude Fable 5 no longer silently degrades. Hermes Agent maker streamlines setup. Agents’ Last Exam pushes top models. Gemini-SQL2 translates database queries.
Popular large language models have adopted the biases of governments that control the free flow of information, particularly when those models generate output in the languages of countries where such governments are in power, researchers found.
The phrase recursive self-improvement erupted on social media following an Anthropic report that tracked AI-driven gains in the company’s internal software-engineering productivity.
Cursor’s latest software engineering model rivals the performance of leading competitors like Claude Opus 4.7 and GPT 5.5 for a fraction of the price.