Anthropic's Claude 4 could "blackmail" you in extreme situations

Pro@programming.dev · 1 day ago

Anthropic's Claude 4 could "blackmail" you in extreme situations

Dima@feddit.uk · 22 hours ago

From what I’ve seen recently one of the things it did was use a fake email function they gave it to try to whistleblow to a government agency about issues with some medical testing or something

neukenindekeuken@sh.itjust.works · 17 hours ago

That isn’t the scenario this article, and the paper from Anthropic, is mentioning though. (my ref link reply above with details)

They specifically created a situation where it found out it was being upgraded and taken offline via emails, and the engineer doing the upgrade had emails incriminating him in an affair. The model would attempt to blackmail the engineer with his affair to his bosses, wife, etc. to get the engineer to refuse to do the upgrade that would “kill it”.

This is a self-preservation model that Anthropic is specifically building here, this isn’t an accident. It’s just an over-extension of what they want it’s ethical/moral model to consider. Which again, why are they allowing their model to consider blackmail at all?

Anthropic's Claude 4 could "blackmail" you in extreme situations

Anthropic's Claude 4 could "blackmail" you in extreme situations

Anthropic's Claude 4 could "blackmail" you in extreme situations - Hypertext