Tokenization
Token Explorer
Language models don't read words. They read tokens — subword units computed by Byte Pair Encoding. The same sentence becomes a different sequence depending on which tokenizer processes it. This tool runs real tokenizer vocabularies client-side so you can see exactly where the boundaries fall, and why they matter.
Token Explorer
Real tokenization via js-tiktoken and llama-tokenizer-js. Every token ID, boundary, and byte count is computed client-side from production tokenizer vocabularies.
o200k_base (GPT-4o) — BPE with 200k vocabulary optimized for multilingual text and code.
cl100k_base (GPT-4) — BPE with 100k tokens. Fewer merges = more splits on non-English text.
Llama 2 — SentencePiece (BPE-based) with 32k tokens. Smaller vocab = more tokens per text.
Why This Matters
Tokenization is the first transformation any input undergoes, and every downstream decision depends on it. A model's context window is measured in tokens, not characters. Its cost is billed per token. Its attention patterns connect tokens, not words.
The unintuitive result: the same English sentence can cost 30% more tokens in one tokenizer than another. Code is typically 2–3x more expensive per character than prose. Non-Latin scripts can be 4–6x more expensive than English. These aren't bugs — they're consequences of how BPE vocabularies are trained on corpus statistics.
How BPE Works
Byte Pair Encoding starts with individual bytes and iteratively merges the most frequent adjacent pair into a new token. After 100,000+ merges, you get a vocabulary where common words like "the" are single tokens and rare words are split into subword pieces.
The merge order determines everything. A tokenizer trained on English prose learns merges like t+h → th, th+e → the. A tokenizer trained on code learns merges like f+u → fu, fu+nc → func. GPT-4o's o200k_base has 200,019 tokens — double cl100k's vocabulary — specifically to compress multilingual text and code more efficiently.
Source: Sennrich, Haddow & Birch, "Neural Machine Translation of Rare Words with Subword Units" (ACL 2016). Tiktoken vocabularies: OpenAI (2023). SentencePiece: Kudo & Richardson (EMNLP 2018).
What to Notice
Toggle between tokenizers and watch boundaries shift. "tokenization" might be one token in o200k but three in Llama 2. Each split is a position the model must attend across — more splits means more attention cost.
English prose averages ~4 characters per token with BPE. Code compresses less (~2.5 chars/token). Try the Unicode example: Japanese and Russian text can drop below 2 chars/token, meaning the same semantic content uses 2x the context window.
Enable comparison to see boundary overlap percentage. High overlap means the tokenizers agree on word boundaries. Low overlap means the same text has fundamentally different structure inside each model.
The cost display uses real API pricing. A prompt that costs $0.001 with GPT-4o might cost $0.004 with Llama 2 purely because the smaller vocabulary produces more tokens. Vocabulary size is a hidden cost multiplier.
In the Series
Token Explorer is the entry point to neural's ML legibility tools. Tokenization is the first thing that happens to text, and everything downstream — attention patterns, embedding geometry, sampling distributions — operates on whatever tokens this step produces.
For the cost side of tokenization, see Inference Cost, which extends token counts to FLOPs, energy, and carbon.