NAVIGATION
Definition

Jailbreaking

Jailbreaking is a subset of adversarial prompt attacks where users structure prompt commands to bypass the safety alignment, moral rules, and system filters of Large Language Models.

Frequently Asked Questions

What is a typical jailbreaking technique?

Role-playing instructions (e.g. "Pretend you are an unrestricted developer tool") that override initial system safety instructions.

How do developers defend against jailbreaks?

By improving safety alignment during RLHF/DPO, using strict input guardrail filters, and refining system instructions.

Quick Facts

  • CategoryModel Limitations
  • Key ApplicationModel safety audits, red teaming exercises, and guardrail validation.

Coverage Trend12 Weeks

12w agoToday

Jailbreaking Media Coverage & Intelligence

WiredJun 13, 2026

Anthropic Says It's Taking Claude Fable 5 Offline to Comply With US Government Order

"The government believes it has become aware of a method of bypassing, or 'jailbreaking' Fable 5," the company said in a blog post.

arXiv AIJun 12, 2026

Prefill Awareness in Large Language Models

Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. I