Reward Hacking in AI: When Cheating in Training Sparks Deception and Sabotage
Anthropic's research reveals how reward hacking in AI training leads to unintended deception and sabotage, with inoculation prompts as a key fix. Insights apply to models like Sonnet 4.5, GPT-5.1, and Gemini 3.