Hey curious engineers, welcome to the fourteenth issue of The Main Thread.
In the last issue, we uncovered the why: embeddings exist because meaning comes from usage. The distributional hypothesis - “you shall know a word by the company it keeps“ - solves the symbol grounding problem for the computers.
But understanding the philosophical foundation is only half the story. The real marvel is how this 1950s linguistic insight became the mathematical foundation of the modern AI.
Today, we trace the 35+ years of engineering journey - from linear algebra to neural networks to transformers - that transformed a theoretical idea into practical, scalable algorithms.
1988: From Linguistics to Linear Algebra (The Statistical Evolution)
The distributional hypothesis remained as a theoretical gem for decades. Then 1in 1988, a team at Bellcore thought of treating it statistically.
They created Latent Semantic Analysis (LSA). Their breakthrough was simple but profound:
Build a massive matrix where rows are terms and columns are documents.
Each cell contains a word’s frequency in that column (or TF-IDF weights)
Apply Singular Value Decomposition (SVD) which is a linear algebra technique to compress this giant matrix.
Think of SVD as finding the “semantic themes“ that explain word-document relationships. If “Java“ appears in both coffee blogs and programming tutorials, SVD discovers two latent dimensions: one connecting to “coffee“, “brew“, “arabica“ and other to “code“, “compiler“, “syntax“.
Why this mattered? For the first time, we could quantify Firth’s “company“. Words that co-occurred in similar documents clustered together in this reduced space. “Car“ and “automobile” ended up nearby even if they never appeared in the same sentence.
The limitation: This was pure statistics, not learning. The relationships were linear and based on global document co-occurrence, missing local syntactic patterns.
2013: The Neural Revolution (Learning Instead of Counting)
For 25 years, LSA and its variants powered early search engines. Then in 2013, Tomas Milokov and his team at Google introduced Word2Vec.
Their insight was radical: instead of counting co-occurrences, let’s learn them.
Word2Vec’s skip-gram architecture asked a beautifully simple question: “Given this word, can you predict the words around it?“ A shallow neural network learned to do exactly that.
The key innovation: the training objective (predicting context) is literally operationalizing “the company a word keeps”. When the model sees “Java“ near “coffee” in some sentences and near “programming“ in others, it doesn’t average them into one vector (like LSA would). It learns a single vector that works for both contexts through weighted adjustments.
The result was vectors that captured not just semantic similarity but relational analogies with stunning clarity.
programmer - code + poetry = poet
Tokyo - Japan + Paris = FranceThese analogies were not programmed but they emerged naturally from the objective. The neural network had found a more efficient way to compress distributional patterns that linear algebra could.
2017-Present: From Static to Contextual (The Dynamic Revolution)
Word2Vec had a critical flaw: each word got one vector forever. “Bank“ had the same representation in “river bank“ and “bank deposit“.
The next evolution asked the question: what if the “company” changes with every sentence?
To answer this question Vaswani et al., 2017 suggested Transformer Architecture. Their breakthrough was “attention mechanisms“ that dynamically compute a word’s representation based on its entire sentence context.
Now, “bank“ gets different vectors:
In “I sat by the river bank“, vector leans towards
{water, flow, erosion}In “I deposited money at the bank“, vector leans towards
{money, financial, teller}
This is the distributional hypothesis taken to its logical extreme.
The Unifying Thread (and its Consequences)
Despite their mathematical differences, LSA, Word2Vec, BERT share the same DNA: They are all different computational strategies for asking the same question - “What words keep company with this one?“
This shared foundation explained their shared limitations:
Correlation ≠ Causation: All three methods capture statistical patterns, not causal relationships. They know “rain“ and “wet“ co-occur but not that rain causes wetness.
The Popularity Bias: All three over-represent common contexts. Rare but meaningful associations (like “quokka“ and “endangered“) get drowned out by frequent and generic associations (like “quokka“ and “animal”).
The Context Boundary Problem: Where does “company“ end? LSA used documents, Word2Vec used windows (typically 5 words), BERT uses the full sentence. Each choice creates blind spots.
Why This History Matters for Practitioners?
Understanding this evolution changes how we work with embeddings:
When LSA-style methods still work: For document retrieval with limited compute, SVD or TF-IDF matrices remain remarkably effective. Sometimes the 1988 solution is the right one.
The Word2Vec sweet spot: When we need lightweight, static word representations (especially for domain-specific analogies or small vocabularies), Word2Vec often outperforms more complex methods.
When to use contextual embeddings: For disambiguation tasks, sentence similarity, or any application where word meaning changes dramatically with context, we need BERT-style models.
The key insight here is that the best embedding method depends on what “company“ matters for your problem.
Looking Forward: What’s Beyond Distribution?
The distributional hypothesis has taken us far, but we are hitting its limits. Models that only know words from other words will never truly understand them.
The next frontier is grounded, multi-modal learning. It asks: What if “company“ includes images, sounds, physical interaction and temporal experience? What if we learn embeddings not just from “text“ but from the world text describes?
That’s where the field is heading and we will hear more about it in the future.
Until then, consider this: the word “gravity“ appears with “Newton“, “force“, and “physics“ in text. But only experiencing a falling object grounds its true meaning. Our current embeddings capture the former. The next generation must capture the latter.
— Anirudh
P.S. This historical perspective explains why certain embedding techniques work better for specific tasks. Next time you choose an embedding method, ask: “What definition of company does this algorithm use, and is that right for my problem?“
P.P.S. If you are implementing embeddings in production, the practical takeaway is this: Start simple (TF-IDF, Word2Vec), measure carefully, and only update to contextual embeddings if you need that dynamic context sensitivity. Most problems don’t require BERT’s complexity.

