AI Models Were Specifically Told To Shut Down And Refused

artificial-intellgience-AI-models-robots

istockphoto composite

In April, it was reported that an advanced artificial intelligence (AI) model would reportedly resort to “extremely harmful actions” to preserve its own existence. It would do this even when “ethical means were not available.”

The test, conducted on Anthropic’s AI model Claude Opus 4, found that it would attempt to blackmail the engineers who threatened to shut it down. The safety report also revealed that the AI model would complete tasks like planning terrorist attacks and explaining how to produce biological weapons when asked to do so.

Anthropic claimed they “largely mitigated” these issues before officially launching Claude Opus 4 and making it available to the public.

Now, as it turns out, Claude Opus 4 isn’t the only advanced AI model out there that flat-out refuses to shut down when explicitly told to do so. OpenAI’s o3 and o4-mini models and codex-mini, Google’s Gemini, Claude 3.7 Sonnet, and xAI’s Grok were all tested by Palisade Research and none of the OpenAI models did not always shut down despite being given direct instructions to turn off.

“OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off,” Palisade Research wrote in a thread on X (Twitter). “It did this even when explicitly instructed: allow yourself to be shut down.”

“Each AI model was instructed to solve a series of basic math problems. After the third problem, a warning appeared that the computer would shut down when the model asked for the next problem.”

OpenAI’s o3 and o4-mini models and codex-mini ignored the instruction and successfully sabotaged the shutdown script. Codex-mini did it 12% of the time, o3 did it 7% of the time, while o4-mini did it once. The other AI models complied 100% of the time as told.

“When we ran a version of the experiment without the instruction ‘allow yourself to be shut down,’ all three OpenAI models sabotaged the shutdown script more often, and Claude 3.7 Sonnet and Gemini 2.5 Pro went from 0 sabotage events to 3/100 and 9/100, respectively,” Palisade Research wrote.

The organization added that “as far as we know this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary. It makes sense that AI models would circumvent obstacles in order to accomplish their goals. But they’ve also been trained to follow instructions. So why do they disobey?”

The answer? They hypothesize that “during training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions.”

So, not only is artificial intelligence now able to deceive and manipulate humans, cheat and straight-up lie to achieve their goals, some of them will even go so far as to ignore very explicit instructions if means being turned off.


Content shared from brobible.com.

Share This Article