Welcome back fellow engineers to the latest issue of The Main Thread.

We have come a long way together.

Issue #1: We saw why tokenization matters: how it controls costs, shapes what models can learn, and determines context window limits.

Issue #2: We went inside the algorithms: BPE, WordPiece, Unigram LM—and understood how machines learn language fragments.

Issue #3: We confronted the fairness crisis: why Swahili speakers pay 1.8x more and what to do about it.

Now, in this final issue, let's close the loop with the question every practitioner eventually faces:

"Should I build a custom tokenizer for my use case?" The answer isn't "yes" or "no." It's "here's how to decide."

The First Bridge

The Decision Tree: Five Questions

Question 1: Is Your Domain Radically Different From Internet Text?

Tokenizers like GPT-2, BERT, and T5 were trained on:

  • Wikipedia (clean, encyclopedic)

  • Common Crawl (web pages, diverse but noisy)

  • Books (formal, literary)

  • News articles (journalistic, structured)

If your deployment domain looks like this, use a pretrained tokenizer. You will get 90% of optimal performance with zero training cost.

But if your domain is:

  • Medical records: "methylprednisolone", "cholecystectomy", "electroencephalography".

  • Legal contracts: "indemnification", "subrogation", "promissory estoppel".

  • Source code: camelCaseVariables, snake_case_functions, @decorators.

  • Financial news: "amortization", "collateralized debt obligations", "LIBOR".

  • Low-resource languages: Yoruba, Igbo, Swahili with 3x fragmentation.

Then you need to measure fragmentation before deciding.

Question 2: How Bad Is the Fragmentation?

Run this experiment:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Sample 1,000 sentences from your target domain
domain_sentences = load_your_domain_data(n=1000)

# Measure tokens per sentence
token_counts = [len(tokenizer.encode(s)) for s in domain_sentences]

avg_tokens = np.mean(token_counts)

# Baseline: Wikipedia English averages 20-25 tokens/sentence
baseline = 22.5

fragmentation_ratio = avg_tokens / baseline

print(f"Fragmentation: {fragmentation_ratio:.2f}x")

Decision thresholds:

  • <1.3x: Your domain is well-represented. Use pretrained tokenizer.

  • 1.3-1.8x: Moderate fragmentation. Consider custom tokenizer if latency/cost is critical.

  • >1.8x: Severe fragmentation. Custom tokenizer will pay for itself quickly.

Real-world example:

Bloomberg trained BloombergGPT with a custom tokenizer on financial news. They measured:

  • GPT-2 tokenizer: 31.2 tokens/sentence (1.39x baseline)

  • Custom tokenizer: 22.8 tokens/sentence (1.01x baseline)

Result:

27% reduction in sequence length → 27% less memory, 27% faster inference, 44% less compute (due to quadratic attention).

At Bloomberg's scale (billions of queries), this saves millions in infrastructure costs annually.

Question 3: Do You Have Enough Training Data?

Tokenizers need data to learn good vocabularies. The rule of thumb:

Vocabulary Size

Minimum Training Data

Recommended

10K tokens

10M tokens (~40MB text)

50M tokens

30K tokens

50M tokens (~200MB)

200M tokens

50K tokens

100M tokens (~400MB)

500M tokens

100K tokens

1B tokens (~4GB)

5B tokens

Why? Each token needs to appear at least 100-1000 times during training to learn a meaningful embedding later.

If you have 50K vocabulary and 10M training tokens, each token appears on average 200 times - barely enough. Many rare tokens will appear 10-50 times, leading to sparse, poorly-learned embeddings.

Decision:

  • You have the data: Train custom tokenizer.

  • You don't have the data: Use pretrained tokenizer, fine-tune the model.

Question 4: What's the Memory Cost?

Every token in your vocabulary needs an embedding vector. The memory cost:

Examples:

30K vocab × 768 dims × 4 bytes = 92 MB 50K vocab × 2048 dims × 4 bytes = 410 MB 100K vocab × 4096 dims × 4 bytes = 1.64 GB

This is just the embedding table. You still need weights for attention, FFN layers, etc.

Decision framework:

If you're deploying on:

  • Cloud (A100 80GB): 100K vocab is fine.

  • Consumer GPU (RTX 3090 24GB): 50K vocab max.

  • Mobile/Edge (4GB RAM): 10K-30K vocab, optimize aggressively.

Larger vocabularies also mean:

  • Slower vocabulary lookups during tokenization.

  • Larger model files (harder to distribute).

  • More training data needed (as discussed above).

Sweet spot for most use cases: 30K-50K tokens.

Question 5: What's the ROI?

Building a custom tokenizer costs:

  • Engineering time: 1-2 weeks (data prep, training, testing, integration)

  • Compute: $100-$500 (training tokenizer + retraining model from scratch or continued pretraining)

  • Validation: 1-2 weeks (measuring downstream task performance)

Total cost: ~$5K-$15K in engineering + compute.

Now calculate savings:

# Example: Medical AI startup
queries_per_month = 10_000_000
avg_tokens_pretrained = 35  # fragments medical terms badly
avg_tokens_custom = 22      # optimized for medical vocabulary

# API costs (example: $0.01 per 1K tokens)
cost_pretrained = (queries_per_month * avg_tokens_pretrained / 1000) * 0.01
cost_custom = (queries_per_month * avg_tokens_custom / 1000) * 0.01

monthly_savings = cost_pretrained - cost_custom
annual_savings = monthly_savings * 12

print(f"Monthly savings: ${monthly_savings:,.2f}")
print(f"Annual savings: ${annual_savings:,.2f}")
print(f"ROI timeline: {15000 / monthly_savings:.1f} months")

# Output:
# Monthly savings: $1,300
# Annual savings: $15,600
# ROI timeline: 11.5 months

If your ROI is <6 months, build the custom tokenizer.

If it's >18 months, use pretrained.

Between 6-18 months? Depends on your strategic priorities (cost vs time-to-market).

When Custom Tokenizers Win: Three Case Studies

Case Study 1: Code Generation (StarCoder)

Problem: GPT-2's tokenizer fragments code badly:

  • camelCaseVariable → ["cam", "el", "Case", "Var", "iable"] → (5 tokens).

  • def init(self): → ["def", " ", "init", "(", "self", "):"] → (6 tokens).

  • Indentation spaces become individual tokens.

Solution: StarCoder trained a custom tokenizer on The Stack (6TB of code in 358+ languages).

Results:

  • 30% reduction in sequence length for Python

  • 40% reduction for less common languages (Haskell, Julia)

  • Significantly better performance on code completion benchmarks

Key insight: Code has different statistical patterns than natural language. Operators ::, ->, ==), naming conventions camelCase, snake_case), and syntactic markers {, }, [ ]) should be atomic tokens.

Case Study 2: Multilingual Models (mT5)

Problem: T5's English-centric tokenizer charged Swahili users 1.8x more.

Solution: mT5 used Unigram LM with SentencePiece, oversampling underrepresented languages during tokenizer training.

Results:

  • Swahili fragmentation reduced from 1.8x to 1.2x

  • 101 languages supported with more equitable vocabulary allocation

  • Better cross-lingual transfer learning

Key insight: When you serve multiple languages, train your tokenizer on a deliberately curated multilingual corpus with oversampling, not proportional sampling.

Case Study 3: Biomedical NLP (BioGPT)

Problem: Medical terminology fragments catastrophically:

  • "electroencephalography" → 8 tokens

  • "methylprednisolone" → 7 tokens

  • "cholecystectomy" → 6 tokens

These are common terms in medical text but rare in general corpora.

Solution: BioGPT trained a custom tokenizer on PubMed abstracts (15M documents, ~4.5B tokens).

Results:

  • "electroencephalography" → 3 tokens

  • "methylprednisolone" → 2 tokens

  • Medical term OOV rate dropped from 68% to 12%

  • 15-20% improvement on biomedical NLU benchmarks

Key insight: Domain-specific terminology should be atomic or near-atomic. If critical terms fragment into 5+ tokens, you're losing semantic coherence.

The Tokenization-Embedding Connection

Here's something most people miss: tokenization and embeddings are inseparable.

When you train a model, the embedding layer learns one vector per token ID: Token ID 4821 → [0.21, -0.45, 0.82, ..., 0.13]

If you change your tokenizer, you invalidate all your embeddings. You must:

  1. Retrain from scratch (expensive, time-consuming)

  2. Initialize new embeddings from old (complex, lossy)

  3. Continue pretraining (middle ground, most common)

This is why tokenizer changes are expensive—it's not just retraining the tokenizer, it's retraining the entire model's representation layer.

The deeper lesson: Tokenization determines what the model can learn. Each token ID is a discrete symbol. If "electroencephalography" is one token (ID 47291), the model learns a single holistic representation. If it's 8 tokens, the model must learn compositional reasoning across those 8 pieces.

Representation determines learnability. Tokenization is representation design.

This is why I titled this issue "The First Bridge from Language to Thought" - tokenization is where raw text becomes structured symbols, which then become continuous vectors (embeddings), which then become neural activations, which then become generated text.

Every bridge you cross shapes what's possible on the other side.

When NOT to Build a Custom Tokenizer

Let's be clear about when it's not worth it to create a custom tokenizer:

Your domain is similar to Wikipedia/web text→ Use GPT-2, BERT, or T5 tokenizers

You have <100M tokens of training data→ Not enough to learn a good vocabulary

You're prototyping and need to move fast→ Optimize tokenizer later, ship the product first

Your model is <1B parameters→ Small models are bottlenecked by capacity, not tokenization

You're serving a single well-supported language (English, Chinese, Spanish)→ Pretrained tokenizers are already excellent

Your ROI timeline is >18 months→ Engineering resources better spent elsewhere

The rule: Build custom tokenizers for specialized domains with clear ROI, not as a default practice.

The Final Checklist

Before you commit to building a custom tokenizer, answer these:

Have I measured fragmentation? (>1.8x = proceed)

Do I have 100M+ tokens of training data? (Yes = proceed)

Is my ROI <12 months? (Yes = proceed)

Am I willing to retrain my model? (Yes = proceed)

Can I monitor tokenization metrics in production? (Yes = proceed)

If all five are , build it. Otherwise, use pretrained tokenizers.

Closing Thoughts: Why This All Matters

We have spent four issues exploring tokenization—why it matters, how algorithms work, fairness implications, and production decisions.

Here's the meta-lesson that ties it all together:

AI systems are pipelines of transformations: Raw Text → Tokens → Embeddings → Activations → Predictions → Generated Text.

Each transformation shapes what's possible in the next stage. If your tokenizer fragments "electroencephalography" into 8 meaningless pieces, no amount of model capacity will let it reason about that term coherently. If your tokenizer charges Swahili speakers 3x more, no amount of post-training RLHF will fix the economic inequity.

Tokenization is where linguistic structure meets computational constraints. It's where human language becomes machine-readable. It's where bias enters before training begins.

And yet, it's invisible to most people building with AI. They tune learning rates, adjust batch sizes, scale model parameters—while never questioning whether their tokenizer is fragmenting critical information.

This series was written to change that.

Where to Go From Here?

If you want to implement these ideas:

If you want the full technical deep dive:

If you want to stay learning with me: Subscribe to this newsletter for more deep dives on AI systems, from first principles.

The Last Thing I'll Say

Every time you build with an LLM, you inherit its tokenizer. And with that tokenizer, you inherit its biases, its fragmentation patterns, its vocabulary allocation decisions.

You can't change the tokenizer of GPT-4 or Claude. But you can measure its behavior on your data. You can decide whether its tradeoffs align with your values. And if they don't, you can build your own.

Before you scale your model, before you optimize inference, before you tune prompts—understand your tokenizer. It's the first bridge your data crosses, and bridges determine what can reach the other side.

Thank you for learning with me through this series. Tokenization is invisible infrastructure, but now you see it.

Until next time, Namaste!

Anirudh

P.S. This concludes our 4-part tokenization series, but the conversation doesn't end here. Hit reply and tell me: what are you building? Have these insights changed how you think about your tokenization strategy?

P.P.S. If this series helped you, forward it to someone who's just getting started with LLMs. Tokenization is the foundation everyone uses but few understand—until now.

P.P.P.S. Next series coming soon: "Embeddings: From Tokens to Meaning"—where we explore what happens after tokenization. Stay tuned.

Keep reading