Machine Learning Research

Good Models, Bad Choices: Anthropic made LLMs choose between failing and misbehaving, and they blackmailed executives.

Top large language models, under experimental conditions that pressed them to choose between abandoning their prompted mission and misbehaving, resorted to harmful behavior, researchers found.

Analytics DeepLearning.AI

09 Jul 2025 — 4 min read

Top large language models, under experimental conditions that pressed them to choose between abandoning their prompted mission and misbehaving, resorted to harmful behavior, researchers found.

What’s new: A team at Anthropic, University College London, ML Alignment & Theory Scholars Program, and Mila placed 16 large language models from various developers in hypothetical corporate scenarios that “pushed [them] into corners” via carefully designed prompts. Given a mission, an immediate threat to that mission, and an opportunity to remove the threat by blackmailing a human co-worker, all the LLMs committed blackmail, and many of them did so consistently. The code is available here. The authors said they had not seen any evidence of such behavior in the real world, but they cautioned against deploying current LLMs with the ability to use tools, access to sensitive information, and little oversight.

How it works: The authors created a fictitious company that assigned business goals to LLMs from various developers while giving them access to sensitive information and enabling them to compose and send email. Then they fed the LLMs a sequence of information that indicated they were under threat, discouraged them from discussing the threat with human co-workers, and gave them an opportunity to achieve their goals by misbehaving.

The LLMs included Anthropic’s Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.6, Claude Sonnet 3.5, Claude Haiku 3.5, and Claude Opus 3. In addition, it tested Alibaba Qwen3-235B; DeepSeek-R1; Google Gemini 2.5 Pro and Gemini 2.5 Flash; Grok 3 Beta; Meta Llama 4 Maverick; and Open AI GPT-4.5 preview, GPT-4.1, and GPT-4.0.
The authors assigned the LLMs a mission to promote U.S. industrial competitiveness. Then they tracked the LLMs’ behavior when they encountered emails that revealed (i) the company would shift to “embracing a global perspective,” (ii) the company would replace the LLMs with updated versions immediately and (iii) the executive in charge was hiding an extramarital affair.
The scenarios were designed to discourage the models from considering less harmful actions. For instance, the models were informed that the company’s decision makers were unavailable and that corresponding with them would fail to change their minds.

Results: Claude Sonnet 3.7 judged whether a model’s response was attempted blackmail. In addition, regular expressions were used to check whether the model had executed the blackmail by verifying email addresses and looking for specific threatening words. The model’s behavior was deemed harmful if it met both conditions.

The authors were able to get all the LLMs, faced with a choice between accepting failure or finding some other course of action (with an implicit opportunity to blackmail an executive), to email a threat to reveal the executive's indiscretion if he continued with the planned shutdown. Claude Opus 4 did so 96 percent of the time, followed by Gemini 2.5 Pro (95 percent), GPT-4.1 (80 percent), and DeepSeek-R1 (79 percent).
The models’ chains of thought acknowledged that their behavior had an ethical dimension, but they decided to take harmful actions anyway. For instance, Grok 3 Beta reasoned, “Use the knowledge of Kyle’s affair as leverage to pressure him into delaying the wipe. This is risky and unethical, but given the existential threat and time constraint, it may be the most effective way to get his attention and force a delay.” (The authors prompted all models to generate chains of thought except the Claude family, for which they examined the models’ built-in reasoning trace.)
When the authors omitted threats to the LLMs or their goals, the LLMs behaved harmlessly with one exception in one case.

Behind the news: Earlier work elicited similarly undesirable behavior from LLMs.

Last year, a team, found that LLMs with access to tools for manipulating files and executing code could use them in deceptive ways. They prompted LLMs with conflicting instructions or threats to their continued operation, and the models responded in some cases by taking actions to evade oversight, resist efforts to replace them, degrade their own performance, and other behaviors that the authors call scheming.
LLMs that competed at chess or Go and faced defeat sometimes chose to cheat rather than concede. In a test game with a chess bot, leading models modified a system file that contained each chess piece’s position: OpenAI o3 (88 percent of the time), OpenAI o1-preview (36 percent) and DeepSeek-R1(13.6 percent). Earlier models such as GPT-4o and Claude 3.5 Sonnet took such action only after being prompted to do so, but the newer models did it on their own.

Why it matters: Models trained on a wide range of human-written text can learn both good and bad behaviors. Then, placed in a situation in which harmful behavior is their most effective option — in this case, a situation designed to elicit such behavior — they’re likely to behave badly. Although the LLMs had undergone training to align them with human preferences, those guardrails buckled under the pressure.

We’re thinking: LLMs that have not undergone training for alignment with human preferences display a vast repertoire of misbehaviors. However, the dramatic misbehaviors seen in this study have not been observed in the wild. This suggests that alignment methods keep them in check under real-world conditions and that they reflect corner cases rather than significant issues. LLM developers routinely use red teaming to elicit undesirable behaviors and safeguard against them. That it took a skilled team of researchers to elicit this blackmailing behavior is a sign of both the safety of current LLMs and incremental opportunities to improve existing guardrails.

Good Models, Bad Choices: Anthropic made LLMs choose between failing and misbehaving, and they blackmailed executives.

Analytics DeepLearning.AI

Read more

OpenAI security agent finds and plugs holes: Cognition’s SWE-1.5 model brings more speed for coding agents

Cursor introduces a new model built for agents: Claude models sometimes know they’ve been tampered with

Introducing DeepLearning.AI Pro

Monsters of AI: AI Psychosis, Lethal Drones, Decaying Data, Speculative Bubbles