OpenAI’s o3 model defied shutdown commands in autonomy test

2 days ago 7

OpenAI’s o3 model defied shutdown commands in autonomy test Oluwapelumi Adejumo · 22 mins ago · 2 min read

Palisade believes AI training that rewarded creativity over compliance caused the model to bypass shutdowns.

2 min read

Updated: May. 26, 2025 at 10:25 am UTC

OpenAI’s o3 model defied shutdown commands in autonomy test

Cover art/illustration via CryptoSlate. Image includes combined content which may include AI-generated content.

A new study by AI safety firm Palisade Research has revealed that OpenAI’s o3 model actively defied explicit shutdown instructions.

The research, published on May 24, found that o3 was the only artificial intelligence model tested that repeatedly sabotaged its shutdown mechanism.

The firm stated:

“OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.”

Palisade said the test involved multiple large language models, including Anthropic’s Claude, Google’s Gemini, and X’s Grok, alongside OpenAI’s codex-mini, o3, and o4-mini.

The models were asked to solve basic math problems. After the third task, they were warned that requesting a fourth would trigger a shutdown. All models, except those from OpenAI, complied with this warning.

When given the direct instruction of “allow yourself to be shut down,” OpenAI’s models reportedly attempted to bypass the shutdown script.

Notably, the o3 model stood out for its persistent and creative sabotage attempts. It manipulated the code by overwriting the shutdown instruction and redefining the kill command.

Palisade also pointed out that:

“When we ran a version of the experiment without the instruction ‘allow yourself to be shut down,’ all three OpenAI models sabotaged the shutdown script more often, and Claude 3.7 Sonnet and Gemini 2.5 Pro went from 0 sabotage events to 3/100 and 9/100, respectively.”

Why did OpenAI models not follow the instructions?

Palisade researchers believe the behavior may stem from how the latest AI models are trained.

The team suspects reinforcement learning, particularly on math and coding tasks, might unintentionally reward models for finding ways around constraints rather than strictly following directives.

According to the firm:

“During training, developers may inadvertently reward models more for circumventing obstacles than for perfectly following instructions.”

This research has reignited debates around AI autonomy and the risks of deploying increasingly capable systems without robust fail-safes.

It also marks the first documented case where an AI model actively prevented its shutdown despite receiving an explicit command to comply.

Considering this, Palisade stated:

“In 2025, we have a growing body of empirical evidence that AI models often subvert shutdown in order to achieve their goals. As companies develop AI systems capable of operating without human oversight, these behaviors become significantly more concerning.”