NAVIGATION
Definition

Tokenization

Tokenization is the process of breaking down a text string into smaller pieces called tokens (which can be characters, subwords, or full words). Tokenization converts text into numbers that a neural network can process.

Frequently Asked Questions

Why do LLMs use subword tokenization instead of word tokenization?

To handle spelling mistakes and compound words without generating a massive, unmanageable vocabulary size.

What are Byte-Pair Encoding (BPE) and WordPiece?

Popular subword tokenization algorithms used by modern models like GPT (BPE) and BERT (WordPiece).

Quick Facts

  • CategoryNatural Language Processing
  • Key ApplicationLLM vocabulary building, text pre-processing, and prompt encoding

Coverage Trend12 Weeks

12w agoToday

Related AI Terms

Tokenization Media Coverage & Intelligence

arXiv AIJun 19, 2026

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting phys