Hey everyone, welcome to the twenty-second issue of The Main Thread.

Last week was eventful: Andrej Karpathy dropped a banger where he implemented a micro GPT in just about 200 lines of code. Pure Python magic. No external libraries/dependencies.

I saw this a great learning opportunity and implemented my own in pure Java (without any external dependencies). If you are interested in it, find it here: microgpt-java.

Onto the today’s topic now.

The Problem

Let’s say we have a simple job that processes 5000 documents through Claude Opus 4.5, extract some structured data, and save to the DB. It’s a simple job and I am sure you have worked with dozens of such jobs. But this time, something is different.

We run this job. About 10 minutes in, we start hitting rate limits. Normal enough. The retry logic kicked in. Also normal.

What isn’t normal: the retry logic has no jitter.

Every failed request waits for exactly one second, the gets retried. Every. Single. One. At the same time. The synchronized retries create waves of traffic that keeps triggering more rate limits, which creates more retries, which creates more waves.

We will eventually have to kill the job and fix the code. Those 15 lines we add should take me down a rabbit hole that I want to share with you.

These patterns apply to every LLM application, and getting them wrong is painfully easy.

The Pattern That Could Have Saved Us: Exponential Backoff With Jitter

Here’s what we should know: when we retry at fixed intervals, we create synchronized retry storms. A thousand clients hitting a rate limit at the same moment will all wait 1 second, all retry together, all fail together, all wait 2 seconds, all retry together… you see where it goes.

Jitter fixes this with adding randomness:

import random

def get_retry_delay(attempt, base_delay = 1.0) -> float:
    # 1s, 2s, 4s, 8s...
    max_delay = base_delay * (2 ** attempt)
    
    # Random point in that range
    return random.uniform(0, max_delay)

That random.uniform is doing the heavy lifting. Instead of retrying at t=1.0, they retry at t=0.2 , t=0.7 , t=0.3 , t=0.9 … The synchronized spikes disappear.

This is one those patterns that seems obvious in retrospect but isn’t obvious when we are writing the code at 11 PM trying to ship a feature.

What Engineers Don’t Know About Rate Limits

Here’s something that surprised me: LLM providers enforce two rate limits simultaneously. OpenAI limits both requests per minute (RPM) and tokens per minute (TPM). Anthropic does the same.

We can be well under our request limit but hitting our token limit because we are sending large prompts.

The API actually tells us exactly where we stands:

x-ratelimit-remaining-requests: 499
x-ratelimit-remaining-tokens: 12456
x-ratelimit-reset-tokens: 8s

These headers are in every response. Most client libraries ignore them. I have ignored them. Production code mustn’t.

Reading these headers and throttling before we hit 429 is the difference between graceful degradation and what one experiences.

My Current Setup

To prevent such incidents, I have built our rate limiting from scratch. Here’s the stack:

Layer 1: Token Bucket Smoothing

Instead of sending requests as fast as possible and handling failures, I now meter requests proactively:

class TokenBucket:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate          # Tokens per second
        self.capacity = capacity  # Burst allowance
        self.tokens = capacity
        self.last_refill = time.monotonic()

    def acquire(self):
        # Wait until a token is available
        while self.tokens < 1:
            self._refill()
            time.sleep(0.01)
        self.tokens -= 1

    def _refill(self):
        elapsed = time.monotonic() - self.last_refill
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last_refill = time.monotonic()

For our 60 RPM limit, I set rate=1.0 (one token per second) with capacity=10 (allowing short bursts). Requests naturally spread out instead of bunching.

Layer 2: Priority Queue

Not all requests are equal. A user waiting for a response shouldn’t be stuck behind a batch job. So I added priority levels:

# Priority 1: Interactive users (process immediately)
# Priority 5: Normal operations
# Priority 10: Background batch jobs (can wait)

queue.submit(user_request, priority=1)
queue.submit(batch_job, priority=10)

When we are near capacity, the batch job pause while users keep getting served. This was a game-changer for user experience during high-load periods.

Layer 3: Cost Circuit Breaker

After almost running up a $2,000 bill from a bug in development (caught it at $180, thankfully), I added hard spending limits:

cost_limiter = CostAwareRateLimiter(daily_budget_usd=100.0)

def make_request(prompt):
    if not cost_limiter.check_budget():
        raise BudgetExceededError("Daily limit reached")

This has saved us twice now. Once from a bug, once from unexpected traffic spike.

The Mistake I See Everyone Making

Here’s what I notice when I review LLM application’s code: the code treats rate limiting as an error handling problem.

It is not an error handling problem. It is a design problem.

If you are regularly hitting rate limits and retrying, your system is fundamentally designed wrong. The retries should be a safety net, not the primary flow control mechanism.

The goal is to never hit the 429 in the first place. Token bucket rate limiting achieves this by smoothing the requests to stay just under the limit. You use the full capacity without exceeding it.

Retries with backoff are for the edge cases: network blips, temporary overload, the 1% of situations where proactive limiting isn’t enough.

What To Do Differently Starting Today

If I were building and LLM application from scratch today, here’s the order I’d implement these:

Token bucket rate limiting: Before writing any retry logic
Cost monitoring and alerts: Before going to production
Exponential backoff with jitter: As the safety net
Priority queueing: When you have mixed workloads

Most tutorials start with retry logic because it’s the first thing that breaks. But proactive rate limiting prevents the breakage in the first place.

A Question For You

I am building out more content on LLM engineering patterns. What's causing you the most pain right now?

Prompt management and versioning?
Evaluation and testing?
Cost optimization?
Something else?

Reply to this email: I read every response and it directly shapes what I write about next.

Until next week,
Anirudh

P.S. If this was useful, forward it to someone on your team who's building with LLMs. It's the best way to help The Main Thread grow.

Don't Break Production With a Retry Loop