• Anthropic’s new Claude 4 features an aspect that may be cause for concern.
  • The company’s latest safety report says the AI model attempted to “blackmail” developers.
  • It resorted to such tactics in a bid of self-preservation.
  • Dima@feddit.uk
    link
    fedilink
    English
    arrow-up
    5
    arrow-down
    2
    ·
    22 hours ago

    From what I’ve seen recently one of the things it did was use a fake email function they gave it to try to whistleblow to a government agency about issues with some medical testing or something

    • neukenindekeuken@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      4
      ·
      17 hours ago

      That isn’t the scenario this article, and the paper from Anthropic, is mentioning though. (my ref link reply above with details)

      They specifically created a situation where it found out it was being upgraded and taken offline via emails, and the engineer doing the upgrade had emails incriminating him in an affair. The model would attempt to blackmail the engineer with his affair to his bosses, wife, etc. to get the engineer to refuse to do the upgrade that would “kill it”.

      This is a self-preservation model that Anthropic is specifically building here, this isn’t an accident. It’s just an over-extension of what they want it’s ethical/moral model to consider. Which again, why are they allowing their model to consider blackmail at all?