Hey everyone, welcome to the twenty-fourth issue of The Main Thread.

Continuing our commitment to understand AI engineering deeply by peeling the layers and discussing patterns around it, in this issue, we will be discussing Context Window Management and how to avoid paying more for worse results.

Introduction

Let’s say your LLM provider support 128K context window. It doesn’t mean that you should use 128K tokens. I know this sounds like I am telling you not to use the thing you are paying for but nobody mentions: longer context doesn’t mean better results. Often, it means mediocre or even worse results at higher costs.

I see engineers on X or Reddit or even some AI companies boast how they stuff their entire codebase into a prompt. What they fail to mention is that their model “forgot“ the critical instructions buried in the middle.

These teams often shoot themselves in the foot with production bills because they thought “more context = more accurate“. I have personally worked with RAG systems that retrieved 50 relevant chunks but performed worse than systems that retrieved only 5.

Context window management is one of those problems that seem trivial until we are in production. Then it becomes the difference between an AI feature that delights users versus the one that frustrates them.

Let’s fix that.

The “Lost In The Middle“ Problem

Before we talk solutions, we need to understand why that matters.

In 2023, researchers at Stanford published a paper that should have been a wake-up call for every AI engineer. The task they gave language models was simple: find a specific task buried somewhere in the long context. The model performed well when the fact was at the beginning and great when it was at the end. But accuracy dropped by 20-30% when the fact was in the middle.

This is “lost in the middle“ problem, and it affects every major LLM.

Position in context:     Beginning    Middle    End
Recall accuracy:         ~95%         ~65%      ~90%

This is because attention mechanisms create U-shaped recall curves which are strong at the edges and weak in the center. This is why language models don’t “see” all tokens equally.

What does this mean practically? If we are stuffing 100K tokens into our prompt, there’s a high chance that model might ignore 30-40% of them. Therefore, we end up paying for tokens that are not contributing to the answer.

Three Chunking Strategies

Most AI applications need to break documents into chunks before embedding or including them in prompts. What dramatically affects retrieval quality is how we chunk.

1. Fixed-Size Chunking

It is the simplest approach where we split text into chunks of N tokens (or characters) with optional overlap.

def fixed_size_chunk(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    """
    Split text into fixed-size chunks with overlap.

    Args:
        text: Input text to chunk
        chunk_size: Target size per chunk (in characters)
        overlap: Characters to overlap between chunks

    Returns:
        List of text chunks
    """
    chunks = []
    start = 0

    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]

        # Try to break at sentence boundary
        if end < len(text):
            last_period = chunk.rfind('.')
            if last_period > chunk_size * 0.8:  # Don't break too early
                end = start + last_period + 1
                chunk = text[start:end]

        chunks.append(chunk.strip())
        start = end - overlap

    return [c for c in chunks if c]  # Filter empty chunks

Pros:

Simple
Predictable chunk sizes
Easy to estimate token counts

Cons

Breaks semantic units arbitrarily
A paragraph about one concept might be split across two chunks
Embedding of each chunk captures incomplete ideas

When to use?

Quick prototyping
Uniform content like logs or structured data

2. Semantic Chunking

In this technique, we split at natural boundaries of language such as paragraphs, sections, or sentences. This strategy respects the document structure.

import re

def semantic_chunk(text: str, max_chunk_size: int = 1000) -> list[str]:
    """
    Split text at semantic boundaries (paragraphs, sections).

    Args:
        text: Input text to chunk
        max_chunk_size: Maximum characters per chunk

    Returns:
        List of semantically coherent chunks
    """
    # Split by double newlines (paragraphs) or headers
    sections = re.split(r'\n\n+|(?=^#{1,3}\s)', text, flags=re.MULTILINE)
    sections = [s.strip() for s in sections if s.strip()]

    chunks = []
    current_chunk = ""

    for section in sections:
        # If section alone exceeds max, split by sentences
        if len(section) > max_chunk_size:
            sentences = re.split(r'(?<=[.!?])\s+', section)
            for sentence in sentences:
                if len(current_chunk) + len(sentence) > max_chunk_size:
                    if current_chunk:
                        chunks.append(current_chunk.strip())
                    current_chunk = sentence
                else:
                    current_chunk += " " + sentence
        # Otherwise, accumulate sections
        elif len(current_chunk) + len(section) > max_chunk_size:
            chunks.append(current_chunk.strip())
            current_chunk = section
        else:
            current_chunk += "\n\n" + section

    if current_chunk.strip():
        chunks.append(current_chunk.strip())

    return chunks

Pros

Preserves meaning within chunks
Better embeddings because each chunk represents a complete idea

Cons

Variable chunk sizes
Some chunks might be tiny (a single heading), others large (a long paragraph)

When to use?

Documentation, articles or any prose content where semantic coherence matters

3. Recursive Chunking

This is a sophisticated approach where we try to split at large boundaries first (sections), then medium (paragraphs), then small (sentences), until chunks are the right size.

def recursive_chunk(
    text: str,
    chunk_size: int = 1000,
    separators: list[str] = None
) -> list[str]:
    """
    Recursively split text, trying larger separators first.

    Args:
        text: Input text to chunk
        chunk_size: Target maximum chunk size
        separators: Ordered list of separators to try

    Returns:
        List of chunks
    """
    if separators is None:
        separators = [
            "\n## ",      # Markdown h2
            "\n### ",     # Markdown h3
            "\n\n",       # Paragraphs
            "\n",         # Lines
            ". ",         # Sentences
            " ",          # Words
        ]

    # Base case: text is small enough
    if len(text) <= chunk_size:
        return [text.strip()] if text.strip() else []

    # Try each separator
    for separator in separators:
        if separator in text:
            parts = text.split(separator)

            chunks = []
            current = ""

            for part in parts:
                candidate = current + separator + part if current else part

                if len(candidate) <= chunk_size:
                    current = candidate
                else:
                    if current:
                        chunks.append(current.strip())
                    # Recursively chunk the part if it's too large
                    if len(part) > chunk_size:
                        chunks.extend(recursive_chunk(part, chunk_size, separators[separators.index(separator)+1:]))
                    else:
                        current = part

            if current.strip():
                chunks.append(current.strip())

            return chunks

    # No separator worked, force split
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

Pros

Best of both worlds: respects semantics while maintaining chunk size bounds
Adapts to document structure automatically

Cons

More complex
Behavior depends on separator ordering

When to use?

Production RAG systems where retrieval quality matters

If you like what you have read so far, consider subscribing to this newsletter.

Context Compression: Fit More Signal, Less Noise

Compression strategies help when we need information from a large document but can’t fit it all in context.

Context Window Management: The Hidden Engineering Problem

Introduction

The “Lost In The Middle“ Problem

Three Chunking Strategies

1. Fixed-Size Chunking

Pros:

Cons

When to use?

2. Semantic Chunking

Pros

Cons

When to use?

3. Recursive Chunking

Pros

Cons

When to use?

Context Compression: Fit More Signal, Less Noise

Reply

Keep Reading

Liked it? Subscribe.

Context Window Management: The Hidden Engineering Problem

Introduction

The “Lost In The Middle“ Problem

Three Chunking Strategies

1. Fixed-Size Chunking

Pros:

Cons

When to use?

2. Semantic Chunking

Pros

Cons

When to use?

3. Recursive Chunking

Pros

Cons

When to use?

Context Compression: Fit More Signal, Less Noise

Subscribe to keep reading

Reply

Keep Reading

Liked it? Subscribe.

The Main Thread