A hearty welcome to the twentieth edition of The Main Thread.
Here’s something that should break your brain a little.
The English language has roughly 170,000 words in current use. To represent text in a way computers can process, the naive approach is to give each word its own dimension.
That’s 170,000-dimensional space. Every word is a single point, infinitely far from other words.
Yet modern AI compresses this into just 384 dimensions and somehow understands this:
”king” - “man“ + “woman” = “queen“
How is this possible?
The Curse of the Obvious Approach
Let’s start with how we might naively represent words:
Give each word a vector with one set bit and remaining unset bits. This is called One-Hot Encoding.
"cat" → [1, 0, 0, 0, 0, ... 0] (position 1)
"dog" → [0, 1, 0, 0, 0, ... 0] (position 2)
"love" → [0, 0, 1, 0, 0, ... 0] (position 3)
and so on...With a 50,000-word vocabulary, each word is a 50,000-dimensional vector.
Clearly, this approach has a fatal flow: every word is at an equal distance from every other word.
See it this way, the distance between “cat” and ”dog” is same as the distance between ”cat” and ”democracy”. The representation captures no meaning, no relationship, no similarity.
This approach is also absurdly wasteful. Each vector has 49,999 0s. It doesn’t make any sense to use 50,000 bits to say “I am word #6941“.
The Miracle of Compression
To tackle the above problem, embedding models take 50,000-dimensional vector space into 384 (or 768, or 1536) dimensional space. Don’t think it as a lossy compression like a JPEG. It is far more interesting.
The embedding model learns to place words in vector space such that the geometric relationship among them reflects their semantic relationship.
embedding("cat") ≈ embedding("dog") # Close together embedding("cat") ≠ embedding("economics") # Far apartIn this 384-dimensional space:
Synonyms cluster together
Antonyms are separated by consistent angles
Analogies become vector arithmetic.
The famous example we saw above:
king - man + woman = queenThis works because the model learned that the “gender direction” in the vector space is consistent across words.
Subtract the male component, add the female component, and we move from king to queen (just think about it)
384-dimensions is enough to encode these relationships for hundreds of thousands of concepts.
Instead of memorizing, the model actually learns the underlying structure.
Why Does This Work? The Intuition
Let’s assume every person on Earth is described with just 10 numbers. This sounds impossible as there are 8 billion+ people on this planet. But consider what 10 dimensions might capture.
Height
Age
Introversion ↔ Extroversion
Risk tolerance
Musical preference (classical ↔ electronic)
Morning person ↔ Night owl
Urban ↔ Rural preference
Analytical ↔ Creative thinking
Physical activity level
Social circle size
With just 10 dimensions, we can meaningfully position billions of people. Similar people are close together while different peoples are far apart.
The arithmetic is simple: "Person A minus their introversion plus extroversion" gives us a more outgoing version of A.
Real-world concepts aren’t actually random. They have structure. They cluster. They relate to each other in consistent ways.
Language has the same property. The 170,000 English words aren’t 170,000 independent concepts. They are much combinations and variations of a much smaller set of underlying ideas.
That’s what embeddings do. They learn to discover those underlying ideas and encode them as dimensions.
What Do Those 384 Dimensions Actually Mean?
Let me tell you upfront: we don’t fully know what do those 384 dimensions actually mean.
Unlike our “describe a person“ example, embedding dimensions are not human interpretable. We can’t point to dimension #47 and say “this is the formality axis“ or “this captures temporal concepts“.
The dimensions are entangled. The meaning is distributed across many dimensions simultaneously. Each dimension participates in encoding different concepts.
Researchers have tried to reverse-engineer these spaces:
Some dimensions correlate loosely with sentiment (positive/negative)
Some seem to capture concreteness (abstract vs. physical)
Some activate for syntactic properties (noun vs. verb)
But mostly, the dimensions are alien. The model found a compression scheme that works but it just didn’t bother making it human-readable.
This is both the power and mystery of learned representations.
The Numbers are Mind-Bending
Let’s appreciate the scale of this compression:

We are going from 50,000 dimensions to 384 (130x compression) while gaining semantic understanding.
This works because the original 50,000 dimensions were an illusion. The “true dimensionality“ of language is much smaller.
Along with compressing that data, embeddings reveal the hidden structure that was always there.
Why This Matters for AI Systems
Understanding embeddings changes how we think about AI architecture:
1. Similarity is just a distance
Once text is embedded, finding “similar documents” is just finding nearby points. It’s just simple geometry.
similarity = cosine(embedding_a, embedding_b)2. Meaning can be manipulated mathematically
If we want to find the more formal version of a sentence, we just have to find the formality direction in the embedding space and move along it.
3. Compression enables search at scale
It is computationally impossible to search 10M documents in 50,000-dimensional space. But in 384 dimension, we can do nearest neighbour search within milliseconds.
4. Transfer learning works because structure transfers
A model trained on Wikipedia develops and embedding space where ”doctor” is near ”nurse” and ”hospital”. That structure applies even to medical documents it’s never seen.
The Limits of Compression
Unlike magic, embeddings have failure modes:
1. Polysemy
“Bank“ (financial) and “bank“ (river) get a single embedding that’s awkwardly between both meanings. Context-aware models like BERT help, but it is still a challenge.
2. Rare Concepts
If the training data had 3 mentions of “quokka” , the embedding is not reliable. Rare words get poorly positioned.
3. Cultural Bias
The training data reflects human biases. “Doctor“ might be geometrically closer to “man“ than ”woman” . It’s not because of meaning but because of dataset statistics.
4. Out-of-domain Collaps:
An embedding model trained on Wikipedia may position legal or medical jargon poorly, even if the words are common in their domains.
The Profound Implication
Think about it:
If 384 dimensions can capture the semantic relationships between hundreds of thousands of words - all the nuance, all the analogy, all the meaning humans have encoded in language over millennia…
What does that say about the structure of human thought?
Maybe our concepts, as infinite as they feel, are actually projections of a much lower-dimensional space. Maybe the embedding model is discovering something about cognition itself.
Or maybe it’s just a very good compression.
Either way, the next time you use semantic search or ask your favourite LLM a question, remember: somewhere in those 384 dimensions, “meaning“ has a geometry.
And somehow, that geometry works.
What's the most surprising embedding relationship you have discovered? I am always looking for new examples of this geometry in action.
— Anirudh

