Definition
Jailbreaking
Frequently Asked Questions
What is a typical jailbreaking technique?▼
Role-playing instructions (e.g. "Pretend you are an unrestricted developer tool") that override initial system safety instructions.
How do developers defend against jailbreaks?▼
By improving safety alignment during RLHF/DPO, using strict input guardrail filters, and refining system instructions.
Quick Facts
- CategoryModel Limitations
- Key ApplicationModel safety audits, red teaming exercises, and guardrail validation.
Coverage Trend12 Weeks
12w agoToday
Related AI Terms
Jailbreaking Media Coverage & Intelligence
WiredJun 13, 2026
Anthropic Says It's Taking Claude Fable 5 Offline to Comply With US Government Order
"The government believes it has become aware of a method of bypassing, or 'jailbreaking' Fable 5," the company said in a blog post.
arXiv AIJun 12, 2026
Prefill Awareness in Large Language Models
Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. I