Hey everyone, welcome to the fifteenth issue of The Main Thread. In the previous issue, we understood the history and development of different techniques to make sense of how we can use language and its structure to draw insights and make machine understand its nuances.

I have always been fascinated by what happens in the space between words; what are the silent agreements that form when words consistently appear together. When we say “quick brown fox“, we are looking at a statistical pattern that reveals something deeper about how meaning works.

This is where Word2Vec begins. It doesn’t use any complex algorithms but a simple observation so profound it changed natural language processing forever.

The Unseen Architecture of Language

Let’s just think about how we truly learn a new word. When we first encountered “quokka“, we did not reach for a dictionary, we tried to absorb its meaning from the context.

"The quokka hopped across the grass."
"These small marsupials are native to Australia."
"Quokkas are related to kangaroos."

Our brain did not memorize the definition, it built a web of associations: animal, hops, small, marsupial, Australia, kangaroo-family. This is how meaning actually works - not as isolated definitions, but as relational patterns.

Word2Vec formalizes this intuition with a beautiful constraint: Words that appear in similar contexts, must have similar meanings. It is not telling the computer what the words mean but it is creating conditions where meaning must emerge to solve a practical problem.

Two Ways of Asking the Same Question

The brilliance of Word2Vec lies in its simplicity. It takes the fundamental insight of words having same meaning if they appear in similar context and creates two variations of the same question:

The Skip-Gram Approach

If I show you a word, can you tell me what words usually surround it?

When you see “fox“, what words come to mind? “Quick“, “brown”, “jumps“, “forest“. This approach treats each word as a clue to its environment.

The CBOW Approach

CBOW stands for Continuous Bag of Words. If I tell you a set of surrounding words, can you tell me what words is likely to be appear in the middle?

When you see “Quick“, “brown“, “jumps“, “over“, what word fits in the center? Fox is the obvious answer.

Both methods are asking the same fundamental question: What words keep company with each other? They just approach it from opposite directions.

Why This Creates Real Understanding

Let’s understand the magic behind Word2Vec. Consider two words that might never appear together - “fox“ and “wolf“. In a real system, they’d remain unrelated. But in Word2Vec’s world, they discover each other through second hand connections.

"Fox" appears with: {quick, brown, forest, jumps}

"Wolf" appears with: {gray, timber, forest, runs}

They share the context “forest“, and their other companions are semantically similar (quick → runs, brown → gray). The model doesn’t see them as identical, but it recognizes they operate in similar linguistic environments. It gives them similar mathematical representations, in order to efficiently predict them.

This is the distributed hypothesis in action: meaning isn’t stored in words themselves but in the relationships between them. By forcing the model to excel at predicting the word’s neighbourhoods, we incidentally force it to discover semantic relathionships.

The Quiet Revolution

This philosophical shift is what makes this approach revolutionary, and not mathematics. For decades, AI researches tried to encode meaning through rules, ontologies, and hand crafted relationships. Word2Vec demonstrated that meaning could emerge from simple pattern recognition at scale.

The computer isn’t understanding words in the human sense. It is discovering that certain mathematical arrangements of vectors minimize prediction error. But in doing so, it stumbles upon the same semantic structures we intuitively recognize.

This explains why embeddings trained on medical text differ from those trained on fiction, and why legal document embeddings capture nuances invisible in social media training. Each domain has its own pattern of companionship, and embeddings faithfully record these relationships.

The Limits of This Approach

Yet for all its elegance, this method has a fundamental constraint: it can learn only from what words appear with, never from what words actually are. When the model learns “summer“ appears with “hot“, “vacation“ and “beach“, it only captures statistical patterns not the experience of heat on the skin or the smell of the sunscreen.

This is the modern incarnation of Symbol Grounding Problem: our most sophisticated language models are still, at their core, learning to navigate between symbols without ever touching what those symbols actually mean in the physical world.

Next issue, we will explore how this elegant idea scales to practical implementations through negative sampling, and why the choice between Skip-Gram and CBOW represents different philosophical approaches to learning meaning.

Until then, consider this: What words would you expect to keep company with "democracy"? Now ask yourself, would those be the same in a political science textbook, a news article, and a historical novel? The answer reveals why there is no single "correct" embedding, only representations faithful to specific contexts.

— Anirudh

P.S. If you can’t wait until next week and want to dive deeper into the technicalities, mathematics of Word2Vec with Skip-Gram and CBOW, you can check my detailed blog post on the subject - Understanding Embeddings: Part 2 - Learning Meaning from Context.

The Hidden Order in How We Use Words