Tokenization: The First Decision That Shapes Everything Your LLM Does

Every kirana store has a billing system. If the system was set up for a shop that mostly sells atta, dal, and rice, those items get single, efficient product codes — beep, done. But walk in asking for high-mountain oolong tea or high-altitude quinoa, and the shopkeeper has to punch in each word letter-by-letter from a handwritten label. Slower billing. Higher effort. And the shopkeeper has never stocked half of it, so good luck getting a recommendation.

That billing system? That’s a tokenizer. And unless you’ve looked closely at how yours works, you’re probably paying a tax you didn’t know existed.

The unit problem

Here’s a question worth sitting with before we go further: when you feed text into a language model, what unit of text should it process?

Characters seem like the obvious starting point — computers already store text as numbers, one per character. But character-by-character processing is painfully granular. The letter “u” carries almost no meaning on its own. Your brain doesn’t read that way either.

Words feel more natural. But words break in ugly ways. Consider “unhappiness.” If a word-level system only ever saw “happy,” “unhappy,” and “happiness” during training, then “unhappiness” is a complete stranger — unrepresentable. Now multiply that problem across Hindi, Mandarin, Python code, medical jargon, and internet slang. Your vocabulary explodes, and you still can’t handle a typo.

The insight that makes modern tokenizers work is this: the right unit is between a character and a word — a sub-word chunk, discovered automatically from data rather than hand-designed. “Un” + “happiness.” “Ing” as a single piece. “The” compressed into one slot. Chunks that earn their place by showing up often enough to justify it.

How a vocabulary gets built

The algorithm behind this is called Byte Pair Encoding, and it’s more intuitive than it sounds.

You start with the smallest possible vocabulary: every individual byte (256 of them). Then you scan a massive training corpus and ask one question over and over — which two adjacent tokens appear together most frequently?

You merge that pair into a new single token. Add it to the vocabulary. Repeat.

Early on, “t” and “h” merge into “th.” Then “th” and “e” merge into “the.” Common suffixes like “ing” and “tion” earn their own slots. You keep merging until you hit a target vocabulary size — say, 50,000 or 100,000 tokens.

BPE Algorithm:
1. Start with base vocabulary: all 256 individual bytes
2. Find the most frequent adjacent pair in the corpus
3. Merge that pair → one new token added to vocabulary
4. Repeat until target vocabulary size is reached

What falls out of this process is elegant. Common words like “the” become single tokens. A rare word like “philanthropy” might split into “phil” + “anthropy” — each sub-word frequent enough on its own to have earned a slot. Nothing is ever dropped. If a word never earned its own token, the system falls back to smaller and smaller chunks, all the way down to individual bytes if necessary.

You can see this in action. Take the nonsense string “zqo” and run it through OpenAI’s cl100k_base tokenizer. It produces three token IDs: 89, 80, 78 — one per character. “zqo” never appeared frequently enough in the training corpus to earn a merge, so the tokenizer falls back to individual byte-level tokens. And those IDs — 89, 80, 78 — aren’t derived from any clean formula. They’re arbitrary lookup keys assigned during vocabulary construction, pointing to entries in an embedding table. The number itself carries no meaning; only the embedding it maps to does. Token IDs are artefacts of implementation order, not semantics.

One distinction tripped me up at first, and it’s worth being explicit about. There are two separate phases here. Phase one is vocabulary building — this happens once, during tokenizer training, and the vocabulary gets frozen. Phase two is encoding — this happens every time you send text to the model. Your text gets broken into tokens from the existing frozen vocabulary. The vocabulary doesn’t grow or shrink based on what you type later. It was decided before you arrived.

The fingerprint in the vocabulary

Here’s where it gets interesting. If the vocabulary is shaped by frequency in training data, then two companies training on different data will end up with different vocabularies — even at the same size.

A company that trains mostly on English text will merge English patterns efficiently. “The,” “ing,” “tion” — all single tokens. But Hindi script, Chinese characters, or Python keywords like def and import rarely appeared in their corpus, so those patterns never earned dedicated slots.

A company that trains on English, Hindi, code, and multilingual data? Their vocabulary reflects that diversity. def is one token. Common Hindi words get their own slots.

The vocabulary is a fingerprint of the training data. It tells you what the model was optimised for — and, just as importantly, what it wasn’t.

This means the same input text can produce wildly different token counts across models. A line of Python tokenised by one model might take 8 tokens. The same line through another model’s tokenizer might take 14. Same text, different cost, because tokenizer efficiency varies by domain and language.

And the vocabulary size itself? That’s a hyperparameter — a design choice engineers make by balancing two competing forces. A larger vocabulary gives better compression (fewer tokens for the same text) but requires a larger embedding table, which means a bigger model and more memory. A smaller vocabulary keeps the model lean but needs more tokens to represent the same content. Teams find their sweet spot empirically by measuring compression ratio on a representative evaluation corpus. And “representative” is doing enormous work in that sentence — a tokenizer benchmarked only on English can look optimal until it meets Telugu.

The double penalty

This is the part that should bother you.

A model like GPT-4, trained heavily on English, encountering Telugu or Tamil text has to fall back to byte-level fragments. Where an English sentence might take 25 tokens, the equivalent Telugu sentence might take 80 or more — each one still requiring a full pass through billions of parameters, consuming real GPU cycles and real electricity.

So you pay more. But you also get worse results, because the model saw little of that language during training. Higher cost and lower quality. A double penalty, hiding in the tokenizer.

This isn’t hypothetical. It’s a real driver behind the emergence of language-specific models — like Sarvam AI building tokenizers trained on Indian-language corpora, where Telugu words earn efficient single tokens and the model has actually learned the language’s patterns.

Why the price tag isn’t what you think

Most developers know that LLM APIs charge per token. But there’s a subtlety in how cost actually works that’s easy to miss.

Two independent levers drive your bill. The first is token count, which is determined by tokenizer efficiency — how many tokens your text becomes. The second is per-token compute cost, which is determined by model size. A larger, more capable model does more computation per token because it has more parameters, which means more powerful hardware and higher price per token.

These two levers move independently. They don’t cancel each other out.

Here’s a concrete example. Model A tokenises your text into 100 tokens and charges $0.002 per 1K tokens. Model B has a better vocabulary — same text becomes only 60 tokens — but charges $0.01 per 1K tokens.

# Model A
cost_a = (100 / 1000) * 0.002  # = $0.0002

# Model B (better compression, higher per-token price)
cost_b = (60 / 1000) * 0.01    # = $0.0006

Model B compresses better but costs 3x more. Better vocabulary does not mean cheaper. The only reliable comparison is to measure actual token counts on your text using each model’s tokenizer — tools like tiktoken or the Tiktokenizer playground let you paste your actual text and see the exact token breakdown before spending a cent — then multiply by the published per-token price.

Images play by the same rules

If you’re working with multimodal models, the pricing logic extends cleanly — but the numbers get big fast.

Images can’t be tokenised with BPE. Adjacent pixel values don’t repeat with the kind of frequency that text patterns do. Instead, the image gets divided into a fixed grid of patches — say 16×16 pixels each. A 512×512 image gives you roughly 1,024 such patches. Each patch is then flattened into a vector and projected through a Vision Encoder into an embedding — one visual token per patch, conceptually parallel to a text token’s embedding but produced through a completely different mechanism.

A sentence describing the same image might be 25 text tokens. Same machine, 40x more pieces. Some providers like Anthropic and Google charge image tokens at the same per-token rate as text. Others like OpenAI charge a premium for image input tokens. But regardless of the per-token rate, the dominant cost driver is the same: volume. An image simply produces far more tokens than equivalent text.

This means the same cost-reduction intuitions apply. Crop to the relevant region instead of sending a full 4K image. Downsample when fine detail isn’t required. For text, compress chat history with a sliding window or summarisation instead of resending the full conversation every call.

The door that’s open

Here’s what sits with me after working through all of this. The tokenizer is one of the earliest design decisions in a model’s lifecycle, made before training even begins, and its effects compound through everything downstream — model quality, inference cost, language equity, multimodal pricing. It’s not plumbing. It’s architecture.

And yet most of us treat it as invisible infrastructure. We compare models by benchmark scores and per-token pricing without ever asking: how many tokens does my actual workload produce through this specific tokenizer?

So here’s a question I’m still turning over: if tokenizers are frozen artifacts tied to specific model versions, and providers can silently swap the model behind an API endpoint, what happens to the cost estimates you benchmarked last quarter — and how would you design an application that detects when the ground has shifted beneath it?

The unit problem#

How a vocabulary gets built#

The fingerprint in the vocabulary#

The double penalty#

Why the price tag isn’t what you think#

Images play by the same rules#

The door that’s open#

Comments