Tokenization

Token Explorer

Language models don't read words. They read tokens — subword units computed by Byte Pair Encoding. The same sentence becomes a different sequence depending on which tokenizer processes it. This tool runs real tokenizer vocabularies client-side so you can see exactly where the boundaries fall, and why they matter.

neural / tokens
Token ExplorerReal tokenization via js-tiktoken and llama-tokenizer-js. Every token ID, boundary, and byte count is computed client-side from production tokenizer vocabularies.
Input
The quick brown fox jumps over the lazy dog.
10
Tokens
44
Characters
4.40
Chars/Token
44
Bytes
200,019
Vocab Size
$<0.001
Cost (input)
Context window (claude-3.5-sonnet)10 / 200,000 (0.005%)
Token Boundaries
The·quick·brown·fox·jumps·over·the·lazy·dog.
Top Tokens
the
2
quick
1
brown
1
fox
1
jumps
1
over
1
lazy
1
dog
1
Compression
Chars/Token4.40x
Tokens/Word1.11x
Bytes/Token4.40x
English prose ~4 chars/token with BPE. Code and non-Latin scripts compress less.
How these tokenizers work
o200k_base (GPT-4o) — BPE with 200k vocabulary optimized for multilingual text and code.
cl100k_base (GPT-4) — BPE with 100k tokens. Fewer merges = more splits on non-English text.
Llama 2 — SentencePiece (BPE-based) with 32k tokens. Smaller vocab = more tokens per text.

Why This Matters

Tokenization is the first transformation any input undergoes, and every downstream decision depends on it. A model's context window is measured in tokens, not characters. Its cost is billed per token. Its attention patterns connect tokens, not words.

The unintuitive result: the same English sentence can cost 30% more tokens in one tokenizer than another. Code is typically 2–3x more expensive per character than prose. Non-Latin scripts can be 4–6x more expensive than English. These aren't bugs — they're consequences of how BPE vocabularies are trained on corpus statistics.

How BPE Works

Byte Pair Encoding starts with individual bytes and iteratively merges the most frequent adjacent pair into a new token. After 100,000+ merges, you get a vocabulary where common words like "the" are single tokens and rare words are split into subword pieces.

The merge order determines everything. A tokenizer trained on English prose learns merges like t+h → th, th+e → the. A tokenizer trained on code learns merges like f+u → fu, fu+nc → func. GPT-4o's o200k_base has 200,019 tokens — double cl100k's vocabulary — specifically to compress multilingual text and code more efficiently.

Source: Sennrich, Haddow & Birch, "Neural Machine Translation of Rare Words with Subword Units" (ACL 2016). Tiktoken vocabularies: OpenAI (2023). SentencePiece: Kudo & Richardson (EMNLP 2018).

What to Notice

Boundary Splits

Toggle between tokenizers and watch boundaries shift. "tokenization" might be one token in o200k but three in Llama 2. Each split is a position the model must attend across — more splits means more attention cost.

Compression Ratios

English prose averages ~4 characters per token with BPE. Code compresses less (~2.5 chars/token). Try the Unicode example: Japanese and Russian text can drop below 2 chars/token, meaning the same semantic content uses 2x the context window.

Compare Mode

Enable comparison to see boundary overlap percentage. High overlap means the tokenizers agree on word boundaries. Low overlap means the same text has fundamentally different structure inside each model.

Cost Implications

The cost display uses real API pricing. A prompt that costs $0.001 with GPT-4o might cost $0.004 with Llama 2 purely because the smaller vocabulary produces more tokens. Vocabulary size is a hidden cost multiplier.

In the Series

Token Explorer is the entry point to neural's ML legibility tools. Tokenization is the first thing that happens to text, and everything downstream — attention patterns, embedding geometry, sampling distributions — operates on whatever tokens this step produces.

For the cost side of tokenization, see Inference Cost, which extends token counts to FLOPs, energy, and carbon.

← All Neural Pages