Hey fellow engineers, welcome to the eleventh issue of The Main Thread.

Let me show you something that should make you uncomfortable.

Take this sentence in English: "Hello, how are you?" → 6 tokens

Now translate it to Swahili: "Hujambo, u hali gani?" → 11 tokens

Same semantic content. Same information. But the Swahili speaker just paid 1.8x more for API usage. They consumed 1.8x more of their context window. And because attention is O(n²), their query took roughly 3.2x longer to process.

This isn't a bug. This isn't an edge case. This is a direct, measurable consequence of how we train tokenizers. And if you're building with LLMs, you need to understand it.

The Vocabulary Allocation Problem

Remember from the last issue: every tokenizer has a fixed vocabulary size—typically 30K-50K tokens. When we train a tokenizer, we are making 50,000 hard choices about which linguistic units deserve atomic representation.

This is a zero-sum game. Every token allocated to one pattern is unavailable for another.

Here's what happens when we train on typical internet corpora:

Training data composition (typical):

English: 60-70%
European languages (French, German, Spanish): 15-20%
Chinese, Japanese, Korean: 5-10%
Everything else: 5-10%

BPE and WordPiece allocate vocabulary proportionally to frequency. If English morphemes like "ing", "tion", "pre-" appear in 60% of your training data, they dominate your vocabulary. English words get efficient single-token representations.

Languages in that "everything else" bucket? They fragment excessively.

Multilingual Token Fairness

The Real-World Impact: It's Not Just Academic

Let's make this concrete with actual research.

Researchers measured GPT-3's tokenizer across languages

Exploring Potentials of ChatGPT on Cross-linguistic Agricultural Text Classification
Multilingual Evaluation of Generative AI
Here's what they found:

Language	Tokens per Sentence	Cost Multiplier vs English
English	15	1.0x (baseline)
French	19	1.27x
Swahili	27	1.8x
Yoruba	35	2.3x
Igbo	45	3.0x

A user writing in Igbo pays 3x more than an English user for the same semantic content.

But it gets worse.

The Training Data Penalty

GPT-3 was trained with a 128K token budget per example. If we are training on English text, that's roughly 90K-100K words of context. Rich, dense information for the model to learn from.

If we are training on Igbo text with 3x fragmentation, that same 128K token budget gives you only 30K-40K words. The model sees 1/3 as much semantic content per training example.

This compounds:

Underrepresented languages get less vocabulary allocation (more fragmentation)
More fragmentation means fewer words per token budget
Fewer words means the model sees less of that language during training
Less exposure means worse performance
Worse performance reinforces the perception that "the model isn't good at this language"

It's a vicious cycle, and it starts at tokenization.

Case Study: African American Vernacular English

The bias isn't just cross-lingual—it's dialectal.

A study on BERT's tokenizer (Stance Prediction for Contemporary Issues: Data and Experiments) found that African American Vernacular English (AAVE) fragments 2.1x more than Standardized American English (SAE).

Example sentences:

SAE: "I'm going to the store" → 6 tokens
AAVE: "I'm finna hit the store" → 8 tokens

Why? Because BERT was trained primarily on formal written English—news articles, Wikipedia, books. Dialectal variants like "finna" (fixing to/going to) don't appear frequently enough to earn single-token representation. They fragment into unfamiliar pieces.

The consequence: representational harm.

A user writing in AAVE sees their language treated as "abnormal" by the system. Their text fragments more, signaling that their dialect is somehow "broken" or "non-standard." This isn't just inefficiency—it's a value judgment encoded in the vocabulary.

Measuring Fairness: Four Metrics That Matter

If you're deploying tokenizers in production, here's how to audit for bias.

1. Average Tokens Per Sentence (by Language/Dialect)

Take 1,000 sentences from each demographic group you serve. Tokenize them. Measure mean and standard deviation.

def audit_tokens_per_sentence(sentences, tokenizer):
  token_counts = [len(tokenizer.encode(s)) for s in sentences]
  return {
    'mean': np.mean(token_counts),
    'std': np.std(token_counts),
    'p95': np.percentile(token_counts, 95)
    }

  # Example output:
  # English:  mean=22.3, std=8.1, p95=35
  # Swahili:  mean=41.7, std=15.2, p95=68
  # Multiplier: 1.87x

Red flag threshold: If any group shows >2x the baseline, we have a serious fairness problem.

2. Bytes Per Token (Compression Ratio)

Higher bytes/token = better compression = more information encoded per token.

def audit_compression_ratio(sentences, tokenizer):
  total_bytes = sum(len(s.encode('utf-8')) for s in sentences)
  total_tokens = sum(len(tokenizer.encode(s)) for s in sentences)
  
  return total_bytes / total_tokens

# Example output:
# English:  5.8 bytes/token
# Arabic:   2.3 bytes/token
# Efficiency gap: 2.5x worse compression

If Arabic achieves 2.3 bytes/token while English achieves 5.8, Arabic users are paying 2.5x more tokens for the same information density.

3. Effective Out-of-Vocabulary Rate

What percentage of common words become single tokens vs multi-token fragments?

def audit_oov_rate(word_list, tokenizer):
  """
  word_list: Most common 10K words in this language
  """
  single_token_words = [
    w for w in word_list
    if len(tokenizer.encode(w)) == 1
    ]
  return len(single_token_words) / len(word_list)

# Example output:
# English:  76% single-token words
# Hindi:    31% single-token words
# Disparity: 2.5x fragmentation

Target: 70-80% of common words should be single tokens. If we are below 50%, our vocabulary is severely under-representing this language.

4. Fertility Rate (Tokens per Word)

Average number of tokens per word—directly measures fragmentation.

def audit_fertility(sentences, tokenizer):
  words = [w for s in sentences for w in s.split()]
  tokens = [tokenizer.encode(w) for w in words]
  return sum(len(t) for t in tokens) / len(words)

# Example output:
# English:  1.2 tokens/word
# Finnish:  2.8 tokens/word
# Turkish:  3.1 tokens/word

Morphologically rich languages will always score higher (that's linguistic reality), but if we see 3x+ disparities, we need to investigate.

Why This Happens: The Root Cause

The fundamental issue is that tokenizers learn from the statistical distribution of their training data.

If our training corpus is:

70% English
15% European languages
10% Chinese/Japanese/Korean
5% everything else (200+ languages)

Then BPE allocates vocabulary proportionally:

~35,000 tokens optimized for English patterns
~7,500 tokens for European languages
~5,000 tokens for CJK
~2,500 tokens for 200+ other languages

That's roughly 12 tokens per language for the bottom 200 languages. Nowhere near enough to capture morphology, common words, or grammatical patterns.

The tokenizer isn't "biased" in the sense of having malicious intent. It's optimally compressing its training distribution. The bias is in the training data composition.

Solutions: What Actually Works

1. Curate Training Data Deliberately

Don't just dump Common Crawl into your tokenizer. Sample deliberately:

Instead of proportional sampling:

train_tokenizer(corpus) # 70% English dominates

Use balanced sampling:

sampled_corpus = oversample_underrepresented_languages(
    corpus, 
    min_samples_per_language=10_000_000  # 10M tokens minimum
) 
train_tokenizer(sampled_corpus)

This is what SentencePiece (used by T5, mT5) enables. We can specify sampling weights to ensure underrepresented languages get fairer vocabulary allocation.

2. Inspect the Learned Vocabulary

After training, audit what actually got allocated:

def audit_vocabulary_allocation(vocab, language_detector):
  allocations = defaultdict(int)

  for token in vocab:
    lang = language_detector.detect(token)
    allocations[lang] += 1

  return allocations

# Example output:
# English: 28,451 tokens (56.9%)
# French:  4,231 tokens (8.5%)
# Arabic:  892 tokens (1.8%)
# Swahili: 127 tokens (0.3%)

If critical languages are getting <1% of vocabulary, we need to retrain with oversampling.

3. Domain-Specific Tokenizers

If we are serving a specific region or language pair, train a specialized tokenizer:

# Instead of using GPT-2's English-centric tokenizer for Hindi:
tokenizer = train_custom_tokenizer(
    corpus=hindi_corpus,
    vocab_size=50_000,
    model_type='bpe'
)

Real-world example: AI4Bharat/indic-bert trained custom tokenizers for 12 Indian languages and saw 15-30% improvements on downstream tasks compared to multilingual BERT.

4. Monitor in Production

Don't just audit once—monitor continuously:

# Log per-request metrics
def log_tokenization_metrics(text, user_language):
  tokens = tokenizer.encode(text)

  metrics = {
    'language': user_language,
    'text_length_bytes': len(text.encode('utf-8')),
    'token_count': len(tokens),
    'bytes_per_token': len(text.encode('utf-8')) / len(tokens),
    'timestamp': datetime.now()
    }

  log_to_monitoring_system(metrics)

# Alert on disparities
if swahili_avg_tokens / english_avg_tokens > 2.0:
  alert_fairness_team()

Track tokens per sentence, bytes per token, and API costs by language. If disparities exceed our threshold (e.g., 2x), investigate.

5. Communicate Costs Transparently

If we charging per token and you know there are multilingual disparities, be transparent:

Pricing (example):

Base rate: $0.01 per 1K tokens
Languages with <50% efficiency: 50% discount (Swahili, Yoruba, Igbo, Finnish, Turkish, ...)

This doesn't solve the technical problem, but it mitigates the economic harm.

The Uncomfortable Truth

Here's what we need to accept: perfect fairness in tokenization is impossible.

Languages have different morphological complexity. Finnish and Turkish are agglutinative—single words encode what English expresses in sentences. They will always require more tokens per semantic unit.

The question isn't "can we eliminate disparity?" It's "can we reduce unfair disparity?"

Unfair disparity comes from training data imbalance, not linguistic reality.

What we can control:

✅ Oversample underrepresented languages during tokenizer training

✅ Allocate vocabulary fairly across languages, not proportionally to training data

✅ Monitor production metrics and alert on disparities

✅ Communicate costs and limitations transparently

What we can't control:

❌ Morphological complexity (Turkish will always need more tokens than English)

❌ Script differences (logographic vs alphabetic)

❌ Domain-specific terminology fragmentation

But we can measure the difference. And if we measure 3x disparity for Swahili when linguistic complexity only justifies 1.3x, we know the extra 2.3x is training bias, not linguistic reality.

What This Means for Us?

If we are building with LLMs:

We must audit our tokenizer before launch. Measure tokens per sentence, bytes per token, and effective OOV across all languages/dialects we serve.
If disparities exceed 2x, we should consider:
1. Training a custom tokenizer with oversampling.
2. Using a more multilingual-friendly tokenizer (e.g., XLM-RoBERTa, mT5).
3. Applying pricing adjustments to mitigate economic harm.
Monitor in production is necessary. Track tokenization metrics by demographic. Alert on anomalies.
We should be transparent. If our system works better for some groups than others, document it. Users deserve to know.

The Broader Lesson

Tokenization fairness isn't just about tokens—it's about who gets access to AI capabilities.

If your tokenizer makes Swahili 1.8x more expensive, you're pricing out users in Kenya, Tanzania, Uganda. If your tokenizer fragments AAVE 2x more, you're signaling that this dialect is "non-standard."

Model fairness starts at the token level. If you don't measure it, you can't fix it.

What's Next?

In the next issue (the final issue in this series), we will answer the practical question:

"Should I build a custom tokenizer for my domain?"

We will cover:

The decision framework: when custom tokenizers justify the cost.
Metrics and benchmarks for measuring ROI.
Memory/compute tradeoffs (vocabulary size vs embedding quality).
Production deployment: caching, versioning, backward compatibility.
Real-world case studies: medical, legal, code generation.

Want the deep dive? Read the blog series:

Why Tokenization Matters - First principles and the compression lens
Algorithms from BPE to Unigram - Complete implementations with code
Building & Auditing Tokenizers - Production metrics, fairness, and lessons learned

Conclusion

Every fairness conversation in AI eventually touches tokenization.

Why does the model perform worse on this demographic? Check tokens per sentence.
Why are API costs higher for these users? Check bytes per token.
Why does the model seem to "understand" some dialects better than others? Check effective OOV rate.

Fairness isn't just a post-training problem you fix with RLHF. It starts at the representation layer—before the model ever sees a single training example.

Audit your tokenizer. Measure the disparities. Fix what you can. Communicate what you can't.

That's how we build AI systems that work for everyone, not just English speakers.

Until next time, Namaste!

Anirudh

P.S. Have you measured tokenization fairness in your system? Hit reply and share your findings—I'd love to hear what disparities you've discovered and how you're addressing them.

P.P.S. If this issue made you rethink your tokenization strategy, forward it to your ML team. Fairness audits should be part of every deployment checklist.

Tokenization: Fairness Starts at the Token Level