Hello fellow engineers, welcome to the seventeenth issue of The Main Thread.
We type “Explain quantum entanglement like I am five“ in our favourite LLM (mine is Claude, btw) and hit enter.
200ms seconds later, words start appearing on our screen.
In that tiny window (200ms is instantaneous for humans), our text travels through one of the most sophisticated software systems ever built.
It gets shredded into pieces, transformed into geometry, processed by billions of parameters, and reassembled into coherent English.
Today, let’s trace that journey. Not at “magic happens here“ level but at a level where we actually understand what’s going on.
Step 1: Our Text Gets Destroyed (0-5ms)
The first thing happens to the text is brutal; it gets ripped apart.
Our sentence “Explain quantum entanglement like I am five“ doesn’t enter the model as words.
Words are too messy. They have different lengths, infinite variety, and ambiguous boundaries.
Therefore, a tokenizer breaks a sentence into small known tokens. Our seven word sentence becomes:
"Explain" -> ["Explain"] -> 176289
"quantum" -> ["quantum"] -> 48889
"entanglement" -> ["ent", "ang", "lement"] -> 1121, 516, 1254
"like" -> ["like"] -> 1299
"I" -> ["I"] -> 357
"am" -> ["am"] -> 939
"five" -> ["five"] -> 6468Each token gets mapped to an integer, it’s position in a vocabulary of ~200,000 possible tokens.
Why Subwords Instead of Words?
As we can see that the complex word “entanglement“ is broken into smaller tokens like ”ent” , ”ang” , “lement“.
By learning that prefix once, the model handles small words and thousands of others. This compression is what makes it possible to handle any text to throw at it, including words it has never seen before.
The tokenizer doesn’t understand meaning. It is just following learned rules about where to split text.
But these splits matter enormously as they determine how much the API call costs, how much context fits in the window, and even the model can reason about the text correctly.
Step 2: Tokens Become Geometry
Now, we have a sequence of integers - [176289, 48889, 1121, 516, 1254, ...] .
These numbers are meaningless on their own. The integer 1121 doesn’t tell model anything about what “ent“ means. This is where embeddings come in.
Each token ID is used to look up a vector which is a list of 1536 floating point numbers, from a giant table. This list of 1536 floating point numbers is called a vector. These vectors were learned during training to encode semantic meaning.
After this lookup, our prompt isn’t text anymore. It is a matrix of shape: [num_tokens x embeddings_dim] → roughly [10 x 1536] for our example.
Key Insight: In this vector space, meaning becomes geometry.
- Words that mean similar things are clustered together
- Relationships become directions (king - man + woman ≈ queen
- The model “reasons“ by moving through this space
Our prompt is now a cloud of points in 1536-dimensional space and each point represent one token.
The arrangement of these tokens encodes everything the model knows about our text.
Step 3: Tokens Talk to Each Other (10-15ms)
This is where the actual processing happens and this step takes around ~90% of the compute.
Our tokens (now vectors), pass through the transformers layers. GPT-4 has roughly 120 of these layers stacked on top of each other.
In each layer, every token looks at every other token and asks: “How relevant are you to me?“ This is attention.
When processing “Explain quantum entanglement like I am five“:
The token
“five“attends strongly to“like“,“I”and“am“(they form a phrase)The token
“Explain“attends to“quantum“and“entanglement“(what to explain)The token
“entanglement“attends to“quantum“(modifier relationship)
These attention patterns are learned and not programmed. During training, the model discovered that certain relationships between positions matter for predicting the next word.
Each Layer Refines the Representation
Layer 1-20: Basic syntax, local relationships
Layer 40-60: Semantic understanding, entity recognition
Layer 80-120: Abstract reasoning, task understanding
By the final layer, the vector for each token has been transformed. It no longer represents just that token. Now, it represents that token in the context of everything else.
The vector “five“ now encodes: "This is the word 'five' being used to indicate a simplification level for an explanation about quantum entanglement requested by the user."
All of that, compressed into 1536 numbers.
Step 4: Predicting the Next Token (150-200ms)
After 120 layers of transformation, we reach the output.
The model takes the final vector (for the last token in our prompt) and projects it back to vocabulary space: a vector of ~200, 000 numbers - one for each possible next token.
"The" → 0.0002
"Quantum" → 0.0034
"Imagine" → 0.1847 ← High probability
"Sure" → 0.0923
"Let" → 0.0412 ...The model samples from this distribution (with some temperature-based randomness) and picks: “Imagine“
That token gets appended to the sequence. Then the whole process repeats but now with “Imagine“ included in the context.
A 100-token response means running this entire pipeline 100 times. That’s why longer responses take longer because of proportional compute they need to perform.
Numbers That Should Blow Your Mind
In that 200ms first-token response:
What | Scale |
|---|---|
Tokens processed | ~10 |
Embedding lookups | ~10 × 1536 = ~15,360 values |
Attentions computations | ~120 layers * 102 token pairs * 96 heads = ~11.52 million |
Total parameters touched | ~1.8 trillion (GPT-4 estimate) |
Floating point operations | ~1014 (100 trillion) |
All of that happens before we see the first word. And then it happens again. And again… for every single token in the response.
Why This Matters for Us?
Understanding this pipeline changes how we work with LLMs:
1. Token efficiency is cost efficiency
Every token in our prompt and response costs compute. Verbose prompts are slow and expensive. “Explain X simply“ beats “I would like you to please explain X in a simple way“.
2. Context window limits are hard limits
You must have heard about 128k context window. It’s 128k tokens, not characters. Our 50-page document might tokenize to 80k tokens, leaving only 48k tokens for the conversation.
3. Early tokens matter more
Due to how attention works, the model “sees“ our entire prompt, but earlier tokens have more layers to influence later processing. So, we must put most important instructions first.
4. The model doesn’t “remember“ anything
Each API call is stateless. The model reprocesses our entire conversation history every time. There is no persistent memory, just increasingly long context windows.
This is all in this week’s newsletter. So, now every time you hit enter in your favourite LLM, remember that you are launching a spacecraft through 1536 dimensional space.
If this helped you understand LLMs better, share it with someone who thinks they are just “statistics”.
— Anirudh

