Hey there, welcome to the ninth issue of The Main Thread.

Let me start with a riddle that broke the internet: "How many r's are in the word strawberry?"

The answer is obviously three. A five-year-old can count them. Yet GPT-4, Claude, and virtually every frontier LLM gets this wrong or struggles with it.

Why? The answer isn't about model size, training data, or architecture. It's about something that happens before the model ever sees your text: tokenization.

The Invisible Step That Changes Everything

When we type "strawberry" into ChatGPT, the model doesn't see the letters s-t-r-a-w-b-e-r-r-y. Instead, the tokenizer might split it into fragments like ["straw", "berry"] or even treat it as a single atomic token ["strawberry"].

If "strawberry" becomes one token, the model never sees the individual letters. It must somehow memorize that this specific token contains three r's with no compositional reasoning, no letter-by-letter analysis. Just pure memorization of an arbitrary fact about token ID #47382.

This is why counting letters is hard for LLMs but semantic reasoning is easy. The representation determines what's learnable.

❝

The insight: Tokenization isn't preprocessing. It's representation design. And representation determines what models can and cannot learn.

Why Should We Care? Three Reasons

1. It Controls Our Costs

API pricing isn't per word - it's per token. When we pay OpenAI or Anthropic, we are buying tokens, not words. The sentence "Hello, how are you?" might be 6 tokens in English but 11 tokens in Swahili. Same semantic content, 1.83x the cost.

If we are building a product that processes millions of requests, tokenizer efficiency directly impacts our cloud bill. A tokenizer that produces 40 tokens per sentence instead of 20 tokens doesn't just cost 2x more - it costs 4x more, because transformer attention is O(n²).

It Determines Context Window Limits

Claude has a 200K token context window. GPT-4 Turbo has 128K. But how many words is that? It depends entirely on our tokenizer.

For English Wikipedia text with a well-tuned tokenizer, we get roughly 4-6 tokens per word (including punctuation and spaces). So 128K tokens ≈ 90K-100K words ≈ 200-250 pages of text.

But for morphologically rich languages (Turkish, Finnish) or scripts underrepresented in training data (Yoruba, Igbo), the same 128K tokens might only give us 40K-50K words -half the effective context window.

Our tokenizer determines how much information fits in context, which determines what tasks are possible.

It Encodes Bias Before Training Begins

Here's the uncomfortable truth: tokenization is where representational bias enters our model.

If our tokenizer is trained on 60% English, 20% European languages, and 20% everything else, it will allocate vocabulary proportionally. English gets efficient single-token representations for common words. Underrepresented languages fragment excessively.

A user writing in African American Vernacular English sees their text fragment 2x more than Standardized American English. A user writing in Swahili pays 1.8x more for API usage. A user writing in a low-resource language hits context limits 3x faster.

This isn't a bug. It's a direct consequence of training tokenizers on biased corpora.

The Compression Lens: Why Tokenization Exists at All?

Tokenization Pipeline

Let's zoom out. Why do we tokenize in the first place?

Neural networks need fixed-size inputs. Text is infinite and variable-length. We need a bridge between the messy infinity of human language and the structured finiteness of tensors.

We could treat every character as a token. "Hello" becomes ['h', 'e', 'l', 'l', 'o'] → 5 tokens. But "machine learning" becomes 18 tokens (including the space). Context windows fill up fast, attention becomes quadratically expensive, and training slows to a crawl.

We could treat every word as a token. "Hello" becomes ['Hello'] → 1 token. Efficient! But what happens when the model encounters a word it's never seen before? "Supercalifragilisticexpialidocious" wasn't in the training vocabulary. The tokenizer outputs [UNK] (unknown), and the model loses all information.

The solution: subword tokenization.

Break words into fragments that appear frequently across the corpus. "machine learning" might become ["machine", "learning"] → (2 tokens), "supercalifragilisticexpialidocious" becomes ["super", "cal", "ifrag", "ilistic", "expi", "ali", "docious"] → (7 tokens).

Sequence Length Comparison

Common words stay atomic. Rare words decompose into familiar fragments. Nothing ever becomes [UNK].

This is lossy compression. We are encoding the infinite space of human language into a finite vocabulary (typically 30K-50K tokens) by exploiting statistical regularities.

Languages follow Zipf's law: a small number of words (like "the", "is", "of") appear extremely frequently, while the long tail of rare words (like "xylem", "quokka", "eigenvalue") appear rarely. Tokenization exploits this non-uniform distribution - short codes for frequent patterns, longer codes for rare patterns.

Sound familiar? That's exactly how Huffman coding works. And it's information-theoretically optimal (Shannon, 1948) if our training distribution matches your deployment distribution.

When Compression Becomes Lossy

Here's the catch: compression is adaptive to training data.

If we train your tokenizer on English Wikipedia, it learns excellent representations for words like "algorithm", "neural", "gradient" - these become single tokens because they appear thousands of times.

But domain-specific jargon fragments badly:

Medical: "methylprednisolone" → ["meth", "yl", "pred", "nis", "olone"] → (5 tokens)
Legal: "indemnification" → ["in", "dem", "ni", "fication"] → (4 tokens)
Code: camelCaseVariable → ["camel", "Case", "Variable"] → (3 tokens, if we lucky)

If our deployment distribution (medical records, legal contracts, source code) diverges from your training distribution (Wikipedia, Common Crawl), compression becomes very lossy. Semantic units fragment into meaningless pieces.

Tokenization Impact

This is why companies like Bloomberg and Salesforce train custom tokenizers. A tokenizer trained on financial news learns that "amortization", "collateralized", "derivatives" are high-value tokens worth atomic representation. The performance gains are substantial: 15-30% reduction in sequence length, corresponding improvements in latency and cost.

The Three Questions We Should Ask About Any Tokenizer

Before we choose a pretrained tokenizer or train your own, ask:

1. Does this tokenizer fragment my domain-specific terminology?

Let’s take 1,000 sentences from our target domain → tokenize them → count the average tokens per sentence. If we are seeing 40+ tokens per sentence when GPT-2's tokenizer produces 20-25 on Wikipedia, our domain is being fragmented. Consider a custom tokenizer.

2 If I'm serving multiple languages, what's the cost multiplier?

Tokenize the same semantic content in different languages. Measure tokens per sentence. If we see 3x disparity (e.g., 15 tokens for English, 45 for Swahili), we have a fairness problem. Either oversample underrepresented languages during tokenizer training, or accept that we are charging different users different prices for the same value.

3. Am I optimizing for the right task?

If we need character-level reasoning (spelling, anagrams, counting letters), subword tokenization is the wrong choice. Use byte-level or character-level tokenization.

If we need semantic reasoning (translation, summarization, question answering), subword tokenization is optimal. But we can't have both with a single tokenizer.

Representation determines learnability. Choose the representation based on our task.

What's Next?

This is just the beginning. In the next issues, we will cover:

Issue #2: The algorithms that power every LLM → BPE (GPT), WordPiece (BERT), Unigram (T5) and how they differ in surprising ways.
Issue #3: The multilingual fairness crisis → why Swahili speakers pay 1.8x more than English speakers, and how to audit our tokenizer for bias.
Issue #4: Should we build a custom tokenizer? A production decision framework with metrics, tradeoffs, and ROI calculations

If you want to go deeper right now, I have written a comprehensive 3-part blog series:

Why Tokenization Matters - First principles and the compression lens
Algorithms from BPE to Unigram - Complete implementations with code
Building & Auditing Tokenizers - Production metrics, fairness, and lessons learned

I have also open-sourced complete implementations of BPE, WordPiece, and Unigram LM from scratch: ai-engineering/tokenization.

Conclusion

Every time you see an LLM fail at a task, ask yourself:

Is this a modeling failure, or a tokenization failure?
Is the model too small, or did the tokenizer fragment critical information into pieces the model can't reassemble?
Is the model biased, or did the tokenizer allocate vocabulary unfairly?
Is the model slow, or did the tokenizer produce unnecessarily long sequences?

Tokenization isn't everything. But it's the first step, and first steps shape everything that follows.

Before you scale your model from 7B to 70B parameters, before you tune learning rates and batch sizes, before you add retrieval or chain-of-thought—audit your tokenizer. It might be the bottleneck.

Until next time, Namaste!

Anirudh

P.S. Hit reply and tell me: have you ever debugged a model failure that turned out to be a tokenization issue? I'd love to hear your war stories.

P.P.S. If you found this valuable, forward it to a colleague building with LLMs. Tokenization is the invisible infrastructure—everyone uses it, few understand it.

Tokenization: Before Words Become Numbers