Definition
Tokenization
Tokenization is the process of breaking down a text string into smaller pieces called tokens (which can be characters, subwords, or full words). Tokenization converts text into numbers that a neural network can process.
Frequently Asked Questions
Why do LLMs use subword tokenization instead of word tokenization?▼
To handle spelling mistakes and compound words without generating a massive, unmanageable vocabulary size.
What are Byte-Pair Encoding (BPE) and WordPiece?▼
Popular subword tokenization algorithms used by modern models like GPT (BPE) and BERT (WordPiece).
Quick Facts
- CategoryNatural Language Processing
- Key ApplicationLLM vocabulary building, text pre-processing, and prompt encoding
Coverage Trend12 Weeks
12w agoToday
Related AI Terms
Tokenization Media Coverage & Intelligence
arXiv AIJun 19, 2026
Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese
Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting phys