Welcome to the thirteenth issue of The Main Thread.

I remember the first time I read that the embeddings were “vectors that capture meaning“. I didn’t believe it. How could a list of 768 numbers that are abstract, cold, and mathematical possibly capture the warm, messy, nuanced meaning of a word like “home“?

I’d see the famous equation: king - man + woman ≈ queen. It was neat, but it felt like a parlour trick. I often thought that we were dressing up statistics in geometric clothing and calling it "understanding."

I was missing the why.

The abstraction was the problem. We start with vectors, and work backward. But to truly get it, we must start at the beginning, with the fundamental problem that made embeddings necessary in the first place.

The Confusion: From Symbols to Meaning

My background is in systems engineering. Computers made sense: they process clear instructions on clear data. Language broke that model.

When we type "dog", our computer doesn't store a concept. It stores bytes: [100, 111, 103]. These are just arbitrary labels in a lookup table (UTF-8). This is the Symbol Grounding Problem: how do we connect these meaningless symbols to the actual, furry, barking concept of a dog?

Tokenization, which I covered in my last series, converts text to integer tokens (e.g., "dog"4521). This solves the computational problem of representation but not the semantic one. The token 4521 is as meaningless as the bytes [100, 111, 103].

So we try the classic ML solution: one-hot encoding. We represent "dog" as a vector of 50,000 zeros with a single 1 at position 4521.

This is where my intuition finally broke. One-hot encoding creates three catastrophic problems:

  1. All words are islands: The dot product between "dog" and "cat" is zero, the same as between "dog" and "asteroid". It encodes difference, but not degrees of similarity.

  2. It's computationally insane: A 20-word sentence would require 4MB of sparse data before any learning even begins.

  3. It cannot generalize: The model can't infer that "puppy" is related to "dog" because their vectors share no dimensions.

I realized my confusion stemmed from accepting the form of the solution (vectors) without understanding the constraints that shaped it. We don't use vectors because they are elegant; we use them because they are the only mathematical structure that can solve all three problems at once.

The Reframe: Meaning is Not in the Word, But in Its Company

The breakthrough came from 1950s linguistics, specifically the Distributional Hypothesis: "You shall know a word by the company it keeps." (J.R. Firth).

Meaning is not an intrinsic property of the symbol "dog". It emerges from the statistical patterns of its usage. "Dog" appears in contexts with "bark", "vet", "pet". "Cat" appears with "litter" , "meow", "vet", "pet". They share contexts, so they must share some meaning.

This isn't just philosophy; it's a measurable, statistical reality. We can capture it with Pointwise Mutual Information (PMI), which asks: How much more often do these words appear together than by pure chance?

This was the missing link. Embeddings are not arbitrary vectors; they are dense, compressed representations of these co-occurrence statistics. The geometry isn't a convenient visualization; it is the meaning.

When Word2Vec creates a vector where king - man + woman ≈ queen, it is because the distributional patterns of these words in a massive corpus exhibit that parallel structure. The model is not doing algebra on concepts; it's reflecting statistical reality through geometry.

The Key Insight: Similarity is Alignment, Not Proximity

The final piece clicked with a physics analogy: the dot product.

In physics, work done is the dot product of force and displacement. If we push a box at an angle, only the component of our force aligned with the direction of motion does work.

Similarly, the dot product between two word vectors measures their alignment. Do they point in the same semantic direction?

This is why we use cosine similarity (the normalized dot product). We care about the angle between vectors, not their raw magnitude. A long document and a short tweet about dogs should point in the same direction, even if one vector is much longer.

This reframes everything:

  • Before: Embeddings are magical meaning vectors.

  • After: Embeddings are geometric shadows cast by the distributional usage of words. Similarity is the degree to which these shadows overlap.

Why This Matters

Understanding the why changes how we work with embeddings:

  1. We respect their limits. They capture correlation, not causation. They model word usage, not world knowledge. They are powerful, but they are not conscious.

  2. We debug with first principles. If our semantic search is failing, we think about co-occurrence patterns, PMI, and vector alignment, not just tweaking hyperparameters.

  3. We see the continuity. Tokenization (compressing character patterns) flows into embeddings (compressing semantic patterns), which flows into transformer attention (modeling context). It's one cohesive story of compression and representation.

In the next issue, we will dive into how geometry changed how I think about language.

Until then, consider this: What does the vector for "gravity" not know about gravity?

— Anirudh

P.S. If this reframed your understanding, I'd love to hear from you. What's one concept in ML/AI that you've always found confusing? Reply to this email and let me know.

P.P.S. Want to go deeper? Read complete and detailed blog here. It sets the stage for everything we discussed today.

Keep reading