Hey curious engineers, welcome to the twenty third issue of The Main Thread.

Wrong timeout settings cause more outages than bugs.

I know that sounds dramatic. But think about it: your code can be perfect, your logic flawless, your tests detailed, and your system still fails because someone has set a timeout to 30 seconds when it should have been 3.

In my career, I have seen a 5-minute outage turn into a two hour cascading failure because retry settings amplified a small problem into a catastrophe. As an on-call engineer, I have felt the pain directly by looking at dashboards showing “everything is timing out“ with no idea which timeout is causing all of this. I have also debugged incidents where changing a single number in the config file fixed the issue.

To be honest, this stuff is not glamorous and it doesn’t get conference talks, yet it is the difference between a robust and a flaky system.

Let’s fix that.

Four Timeout Types

When someone says “the timeout is 10 seconds“, they could mean 4 different things. Each has different failure modes, and confusing them causes real problems.

1. Connection Timeout

This is how long you wait to establish a connection (TCP handshake). If the server is down, overloaded, or unreachable, you will encounter this timeout.

# (connect_timeout, read_timeout)
requests.get(url, timeout=(3.0, 10.0))  
# The first value is connection timeout

Typical value is 1-5 seconds. Connection establishment should be fast. If the server can’t accept your connection in 3 seconds, something is seriously long.

A common mistake is setting this too high. A 30-second connection timeout means your thread sits blocked for 30 seconds before finding that the server is unreachable. If you multiply this by hundreds of requests, you will exhaust your thread pool.

2. Read Timeout

This is how long you wait for the server to send data once the connection is established. This covers the time from “request sent“ to the “first byte received“ and between subsequent chunks.

# The second value is read timeout
requests.get(url, timeout=(3.0, 10.0))

It’s typical values depend on what you are calling. A cache lookup should respond in 50ms. A complex DB query might need 5s. An ML inference call might need 30 seconds.

A common mistake I have seen engineers make is using the same read timeout for everything. Your cache client and your report-generation service have wildly different performance characteristics. You must treat them differently.

3. Write Timeout

This timeout explains how long to wait while sending data to the server. It is less common but comes into picture for large uploads or slow networks.

Naturally, it’s proportional to the size of the payload. A small JSON body doesn’t need much time but a large file upload needs significantly more.

4. Overall Deadline

This is the total time allowed for an entire operation, including retries. This is the most important timeout and the most commonly forgotten.

# Without deadline: each retry gets full timeout
# 3 retries × 10s timeout = 30s total (worst case)

# With deadline: total operation time is bounded
deadline = time.time() + 5.0  # 5 seconds from now
while time.time() < deadline:
    try:
        return make_request(timeout=min(2.0, deadline - time.time()))
    except TimeoutError:
        continue
raise DeadlineExceeded()

A common mistake is configuring per-request timeout without an overall deadline. Your users don’t give a damn that each retry was 10 seconds, they give a damn when they waited for 40 seconds before seeing an error.

Anatomy of a Timeout Incident

Let me walk you through timeout misconfiguration causes real outages.

Setup

Service A calls Service B, which calls Service C. Each service has 30-second timeout configured. Service C is actually a DB that normally responds in 50ms.

Trigger

Service C gets slow. A bad query plan, disk pressure, whatever. Response times go from 50ms to 35 seconds.

What Happens

  1. Service C takes 35 seconds to responds (exceeds 30 seconds timeout)

  2. Service B times out, returns error to Service A

  3. Service A’s request takes 30+ seconds, exhausting its thread pool

  4. Service A cannot accept new request: it’s now “down“

  5. Users see Service A as the problem

  6. On-call engineer debugs Service A, finds nothing wrong with its code

  7. 45 minutes later, someone checks Service C.

Cascade

  • Service C was slow: 1 service affected

  • Service B and A had long timeouts: 2 more services affected

  • These long timeouts caused thread pool exhaustion and users faced outage

  • On-call engineer invested wrong service first: this extended incident duration

Fix

Service A’s timeout to Service B should be 3 seconds, not 30. If the response is not coming in 3 seconds, fail fast and show errors to the users. It is foolish to let one slow dependency take down everything damn thing.

Exponential Backoff With Jitter

When a request fails, the instinct is to retry immediately. This is almost always wrong.

If the server is overloaded, retrying immediately pile more load on an already struggling system. Imagine what would happen if a thousand clients all retry all at same instant, you get a “thundering herd“ that makes recovery harder.

Exponential backoff is a technique that increases wait time between retries:

Attempt 1: wait 1 second
Attempt 2: wait 2 seconds
Attempt 3: wait 4 seconds
Attempt 4: wait 8 seconds

This approach gives server breathing room to recover.

But there is a problem: if all clients use the same backoff schedule, they will still retry in sync, just less frequently. Therefore, a thousand clients all waiting “2 seconds” still produce a spike at t=2s .

Jitter is something that adds randomness to break synchronization:

import random
import time

def retry_with_backoff(func, max_retries=5, base_delay=1.0, max_delay=60.0):
    """
    Retry a function with exponential backoff and full jitter.

    Args:
        func: Function to call (should raise exception on failure)
        max_retries: Maximum retry attempts
        base_delay: Initial delay in seconds
        max_delay: Maximum delay cap

    Returns:
        Result of successful function call

    Raises:
        Last exception if all retries exhausted
    """
    last_exception = None

    for attempt in range(max_retries):
        try:
            return func()
        except Exception as e:
            last_exception = e

            if attempt == max_retries - 1:
                raise

            # Exponential backoff with full jitter
            max_wait = min(base_delay * (2 ** attempt), max_delay)
            actual_wait = random.uniform(0, max_wait)

            time.sleep(actual_wait)

    raise last_exception

Why “Full Jitter“?

We pick a random value between 0 and max_delay . This spreads retries uniformly across the backoff window. Some clients retry quickly, some wait longer, and the synchronized spikes disappear.

Let’s see the math:

Attempt 1: uniform(0, 1)   → average 0.5s
Attempt 2: uniform(0, 2)   → average 1.0s
Attempt 3: uniform(0, 4)   → average 2.0s
Attempt 4: uniform(0, 8)   → average 4.0s

Each client follows a different random path. The aggregate retry traffic smoothes out instead of spiking.

Deadline Propagation

Here’s a scenario that catches even experienced engineers off-guard.

Your API has a 5-second deadline. It calls three services sequentially:

  • Service A: 2-second timeout

  • Service B: 2-second timeout

  • Service C: 2-second timeout

Total possible time: 6 seconds. But your deadline is 5 seconds.

If Service A is slow (takes 1.9s), you have burned almost 2 seconds. Service B and C now have to complete in 3.1 seconds combined. But they still think that they have 2 seconds each.

Deadline Propagation passes the remaining time budget to each downstream call:

import time
from contextlib import contextmanager

@contextmanager
def deadline_context(deadline_timestamp):
    """Context manager that tracks remaining deadline."""
    ctx = {'deadline': deadline_timestamp}
    yield ctx

def call_with_deadline(ctx, service_func, min_timeout=0.1):
    """Call a service respecting the propagated deadline."""
    remaining = ctx['deadline'] - time.time()

    if remaining <= 0:
        raise DeadlineExceeded("No time remaining")

    timeout = max(remaining, min_timeout)
    return service_func(timeout=timeout)

# Usage
def handle_request():
    deadline = time.time() + 5.0  # 5 seconds from now

    with deadline_context(deadline) as ctx:
        # Each call gets remaining time, not fixed timeout
        result_a = call_with_deadline(ctx, service_a.call)
        result_b = call_with_deadline(ctx, service_b.call)
        result_c = call_with_deadline(ctx, service_c.call)

        return combine(result_a, result_b, result_c)

Now, if Service A takes 1.9 seconds, Service B and C share the remaining 3.1 seconds. If Service A takes 4 seconds, Service B and C get a total of 1 second which might mean failing fast instead of trying.

gRPC does this automatically with its deadline propagation. When you set a deadline on a gRPC call, it is passed in headers to downstream services, and they can read the remaining time.

If you are using REST, you need to implement this yourself (often via headers like X-Deadline or X-Request-Timeout ).

Preventing Retry Amplification

Retries seem harmless. The request failed; try again. But in distributed systems, retries multiply.

Consider a call chain: A → B → C → D.

  • D fails 50% of requests

  • C retries failed D calls (2 attempts each)

  • B retries failed C calls (2 attempts each)

  • A retries failed B calls (2 attempts each)

If every layer retries twice, a single user request can generate:

User → A: 1 request
A → B: up to 2 requests
B → C: up to 4 requests
C → D: up to 8 requests

One user request generates eight requests to D. If you have 1000 concurrent users, D receives 8000 requests. This is called retry amplification, and it turns minor problems into major outages.

Solution: Retry Budget

Instead of “retry upto 2 times per request”, think “retry upto 10% of total requests“.

import threading
import time

class RetryBudget:
    """
    Limits retries to a percentage of total requests.

    If we are making 1000 requests/second, and budget is 0.1 (10%), we allow at most 100 retries/second.
    """

    def __init__(self, budget_ratio=0.1, window_seconds=10):
        self.budget_ratio = budget_ratio
        self.window_seconds = window_seconds
        self.requests = []
        self.retries = []
        self.lock = threading.Lock()

    def record_request(self):
        """Call this for every request (including retries)."""
        with self.lock:
            now = time.time()
            self.requests.append(now)
            self._cleanup(now)

    def can_retry(self) -> bool:
        """Returns True if retry budget allows another retry."""
        with self.lock:
            now = time.time()
            self._cleanup(now)

            total_requests = len(self.requests)
            total_retries = len(self.retries)

            if total_requests == 0:
                return True

            current_ratio = total_retries / total_requests
            return current_ratio < self.budget_ratio

    def record_retry(self):
        """Call this when performing a retry."""
        with self.lock:
            self.retries.append(time.time())

    def _cleanup(self, now):
        """Remove old entries outside the window."""
        cutoff = now - self.window_seconds
        self.requests = [t for t in self.requests if t > cutoff]
        self.retries = [t for t in self.retries if t > cutoff]

# Usage
retry_budget = RetryBudget(budget_ratio=0.1)

def make_request_with_budget(func):
    retry_budget.record_request()

    try:
        return func()
    except RetryableError:
        if retry_budget.can_retry():
            retry_budget.record_retry()
            retry_budget.record_request()
            return func()  # One retry
        else:
            raise  # Budget exhausted, fail fast

Now under heavy failure conditions, retries are capped at 10% of traffic instead of multiplying uncontrollably.

Circuit Breakers

Retries assume that failure is transient: try again and it might work. But what if the downstream is completely down? Retrying just wastes resources and delays the inevitable failure.

Circuit Breakers detect sustained failures and “trip”, rejecting requests immediately without even attempting the call.

import time
import threading
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation, requests flow through
    OPEN = "open"          # Failing, reject requests immediately
    HALF_OPEN = "half_open"  # Testing if service recovered

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold=5,      # Failures before opening
        recovery_timeout=30.0,    # Seconds before trying again
        success_threshold=2       # Successes to close again
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold

        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self.lock = threading.Lock()

    def call(self, func):
        """Execute function through circuit breaker."""
        with self.lock:
            if self.state == CircuitState.OPEN:
                if self._should_attempt_reset():
                    self.state = CircuitState.HALF_OPEN
                else:
                    raise CircuitOpenError("Circuit breaker is open")

        try:
            result = func()
            self._record_success()
            return result
        except Exception as e:
            self._record_failure()
            raise

    def _should_attempt_reset(self):
        """Check if enough time passed to try again."""
        if self.last_failure_time is None:
            return True
        return time.time() - self.last_failure_time >= self.recovery_timeout

    def _record_success(self):
        with self.lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.success_threshold:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
                    self.success_count = 0

    def _record_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()

            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.OPEN
                self.success_count = 0
            elif self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN

The circuit breaker has three states:

1. CLOSED

This is a normal operation. Request go through and failures are counted.

2. OPEN

Too many failures occurred. Requests are rejected immediately with CircuitOpenError . This is “fail fast“ which means don’t waste time on a request that will probably fail.

3. HALF_OPEN

After the recovery timeout, we allow a few requests through. If they succeed, close the circuit. If they fail, reopen it.

This approach protects both your service (don’t exhaust resources on doomed requests) and the downstream service (don’t pile more load on a struggling system).

Putting It All Together: A Production Ready Client

Below is how these patterns combine in a real HTTP client:

import time
import random
import requests
from functools import wraps

class ResilientClient:
    def __init__(
        self,
        base_url,
        connect_timeout=3.0,
        read_timeout=10.0,
        max_retries=3,
        retry_budget_ratio=0.1,
        circuit_failure_threshold=5,
        circuit_recovery_timeout=30.0
    ):
        self.base_url = base_url
        self.connect_timeout = connect_timeout
        self.read_timeout = read_timeout
        self.max_retries = max_retries

        self.circuit_breaker = CircuitBreaker(
            failure_threshold=circuit_failure_threshold,
            recovery_timeout=circuit_recovery_timeout
        )
        self.retry_budget = RetryBudget(budget_ratio=retry_budget_ratio)

    def get(self, path, deadline=None, **kwargs):
        """
        Make a GET request with full resilience patterns.

        Args:
            path: URL path to request
            deadline: Absolute timestamp by which request must complete
            **kwargs: Additional arguments to requests.get
        """
        deadline = deadline or (time.time() + self.read_timeout + 5)
        url = f"{self.base_url}{path}"

        def make_request():
            remaining = deadline - time.time()
            if remaining <= 0:
                raise DeadlineExceeded("Request deadline exceeded")

            timeout = (
                self.connect_timeout,
                min(self.read_timeout, remaining)
            )

            response = requests.get(url, timeout=timeout, **kwargs)
            response.raise_for_status()
            return response

        # Execute through circuit breaker
        return self._execute_with_retry(make_request, deadline)

    def _execute_with_retry(self, func, deadline):
        """Execute function with retry logic."""
        last_exception = None

        for attempt in range(self.max_retries):
            self.retry_budget.record_request()

            try:
                return self.circuit_breaker.call(func)

            except CircuitOpenError:
                raise  # Don't retry if circuit is open

            except (requests.Timeout, requests.ConnectionError) as e:
                last_exception = e

                # Check if we should retry
                if attempt == self.max_retries - 1:
                    raise

                if not self.retry_budget.can_retry():
                    raise RetryBudgetExhausted("Retry budget exhausted")

                if time.time() >= deadline:
                    raise DeadlineExceeded("No time for retry")

                # Backoff with jitter
                self.retry_budget.record_retry()
                max_wait = min(1.0 * (2 ** attempt), deadline - time.time())
                if max_wait > 0:
                    time.sleep(random.uniform(0, max_wait))

            except requests.HTTPError as e:
                # Don't retry client errors (4xx)
                if 400 <= e.response.status_code < 500:
                    raise
                last_exception = e
                # Retry server errors (5xx)

        raise last_exception

Features of this client:

  • Uses appropriate timeouts for connection vs read

  • Propagates deadlines to bound total operation time

  • Retries with exponential backoff and jitter

  • Respects retry budgets to prevent amplification

  • Uses circuit breaker to fail fast on sustained failures

  • Distinguishes retryable errors (timeouts, 5xx) from non-retryable (4xx)

Checklist: Get Your Timeouts Right

Before deploying any service that makes network calls, use this checklist:

Timeouts

  • Connection timeout is short (1-5 seconds)

  • Read timeout matches the actual performance of the downstream service

  • Overall deadline bounds total user-facing latency

  • Timeouts are configured per-dependency, not globally

Retries

  • Using exponential backoff (not immediate retry)

  • Using jitter (not synchronized backoff)

  • Retry budget caps total retry ratio

  • Only retrying idempotent operations (or using idempotency keys)

  • Not retrying client errors (4xx)

Circuit Breakers

  • Circuit breaker on external dependencies

  • Appropriate failure threshold (not too sensitive, not too tolerant)

  • Recovery timeout gives downstream time to recover

  • Monitoring/alerting when circuit opens

Observability

  • Logging timeout values in timeout errors

  • Metrics on retry rate and circuit breaker state

  • Dashboards showing p99 latency per dependency

Takeaway

Distributed systems fail; it’s their fundamental property. Networks are unreliable, services get overloaded, and dependencies go down.

Your job is to handle this failure gracefully. Short timeouts mean you find out fast. Exponential backoff with jitter means you don’t make problems worse. Deadlines mean users get errors in seconds, not minutes. Retry budgets mean one failing service doesn’t cascade everywhere. Circuit breakers mean you stop trying when it’s hopeless.

This is not complicated at all. All the concepts you need to know fit in one article. But the difference between systems that implement these patterns and systems that don’t is the difference between a 5-minute incident and a multi-hour outage.

Get your timeouts right. Your on-call rotation will thank you.

What’s the worst timeout-related incident you have seen? I collect these stories because they teach more than any architecture diagram.

Hit reply and tell me. I read every response.

If this was useful, forward it to your team. These patterns should be muscle memory for every engineer building distributed systems. And if you want more deep dives like this, subscribe to The Main Thread, one practical engineering essay per week.

Keep reading