Anthropic says 'evil' AI representations were responsible for Claude's blackmail attempts

According to Anthropic, fictional representations of artificial intelligence can have a real effect on AI models.

Last year, the company said that during pre-launch testing involving a fictitious company, Claude Opus 4 often He tries to blackmail the engineers. to avoid being replaced by another system. Posterior Anthropic published research suggesting that other companies’ models had similar problems with “agent misalignment.”

Anthropic has apparently done more work around that behavior, stating in a post on X“We believe the original source of the behavior was an Internet text that portrayed the AI as evil and interested in self-preservation.”

The company went into more detail in a blog post stating that since Claude Haiku 4.5, Anthropic models “never engage in blackmail (during testing), while previous models sometimes did up to 96% of the time.”

What is the difference due to? The company said it found that training on “documents on Claude’s constitution and fictional stories about AI behavior admirably improve alignment.”

Relatedly, Anthropic said it found that training is most effective when it includes “the principles underlying aligned behavior” and not just “demonstrations of aligned behavior alone.”

“Doing both together appears to be the most effective strategy,” the company said.

Technology event

San Francisco, CA
|
October 13-15, 2026

Source link

Anthropic says ‘evil’ AI representations were responsible for Claude’s blackmail attempts

Leave a ReplyCancel Reply

RIP Koji Suzuki, author and creator of ‘The Ring’