Definition

Jailbreaking

Jailbreaking is a subset of adversarial prompt attacks where users structure prompt commands to bypass the safety alignment, moral rules, and system filters of Large Language Models.

Frequently Asked Questions

What is a typical jailbreaking technique?▼

Role-playing instructions (e.g. "Pretend you are an unrestricted developer tool") that override initial system safety instructions.

How do developers defend against jailbreaks?▼

By improving safety alignment during RLHF/DPO, using strict input guardrail filters, and refining system instructions.

Quick Facts

CategoryModel Limitations
Key ApplicationModel safety audits, red teaming exercises, and guardrail validation.

Coverage Trend12 Weeks

12w agoToday

Related AI Terms

Adversarial Attack AI Safety Guardrails

Jailbreaking Media Coverage & Intelligence

WiredJun 13, 2026

Anthropic Says It's Taking Claude Fable 5 Offline to Comply With US Government Order

"The government believes it has become aware of a method of bypassing, or 'jailbreaking' Fable 5," the company said in a blog post.

Read Original Coverage

arXiv AIJun 12, 2026

Prefill Awareness in Large Language Models

Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. I

Read Original Coverage