Alignment
Alignment refers to the process of guiding an AI model's behaviors, responses, and values to match human intents, safety principles, and ethical standards. Unaligned models might generate toxic text, assist in harmful activities, or refuse user inputs.
Frequently Asked Questions
How is alignment achieved in LLMs?▼
Typically through RLHF (Reinforcement Learning from Human Feedback), DPO (Direct Preference Optimization), or supervised instruction tuning.
Can alignment be bypassed?▼
Yes, adversarial prompts or jailbreak patterns can exploit vulnerabilities to bypass aligned safety limits.
Quick Facts
- CategoryAlignment & Safety
- Key ApplicationSafety filtering, toxic text reduction, and brand protection
Coverage Trend12 Weeks
Related AI Terms
Alignment Media Coverage & Intelligence
Measuring Curriculum Alignment across Topical Coverage, Competency, and Cognitive Depth: A Longitudinal Framework Applied to CS2013 and CS2023
Undergraduate computer science is governed by international curricular guidelines revised about once a decade, yet programs lack a reliable, reproducible way to
Emergent Alignment
Can Large Language Models (LLMs) discern when their own outputs are misaligned with human ethics? And can they self-correct? We endow an LLM with a conscience s
Import AI 461: "Alignment is not on track"; FrontierCode; and synthetic research interns
Where are your agents right now?
Prefill Awareness in Large Language Models
Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefilling model outputs. I