Hey curious engineers, welcome to the twenty third issue of The Main Thread.
Wrong timeout settings cause more outages than bugs.
I know that sounds dramatic. But think about it: your code can be perfect, your logic flawless, your tests detailed, and your system still fails because someone has set a timeout to 30 seconds when it should have been 3.
In my career, I have seen a 5-minute outage turn into a two hour cascading failure because retry settings amplified a small problem into a catastrophe. As an on-call engineer, I have felt the pain directly by looking at dashboards showing “everything is timing out“ with no idea which timeout is causing all of this. I have also debugged incidents where changing a single number in the config file fixed the issue.
To be honest, this stuff is not glamorous and it doesn’t get conference talks, yet it is the difference between a robust and a flaky system.
Let’s fix that.
Four Timeout Types
When someone says “the timeout is 10 seconds“, they could mean 4 different things. Each has different failure modes, and confusing them causes real problems.
1. Connection Timeout
This is how long you wait to establish a connection (TCP handshake). If the server is down, overloaded, or unreachable, you will encounter this timeout.
# (connect_timeout, read_timeout)
requests.get(url, timeout=(3.0, 10.0))
# The first value is connection timeoutTypical value is 1-5 seconds. Connection establishment should be fast. If the server can’t accept your connection in 3 seconds, something is seriously long.
A common mistake is setting this too high. A 30-second connection timeout means your thread sits blocked for 30 seconds before finding that the server is unreachable. If you multiply this by hundreds of requests, you will exhaust your thread pool.
2. Read Timeout
This is how long you wait for the server to send data once the connection is established. This covers the time from “request sent“ to the “first byte received“ and between subsequent chunks.
# The second value is read timeout
requests.get(url, timeout=(3.0, 10.0))It’s typical values depend on what you are calling. A cache lookup should respond in 50ms. A complex DB query might need 5s. An ML inference call might need 30 seconds.
A common mistake I have seen engineers make is using the same read timeout for everything. Your cache client and your report-generation service have wildly different performance characteristics. You must treat them differently.
3. Write Timeout
This timeout explains how long to wait while sending data to the server. It is less common but comes into picture for large uploads or slow networks.
Naturally, it’s proportional to the size of the payload. A small JSON body doesn’t need much time but a large file upload needs significantly more.
4. Overall Deadline
This is the total time allowed for an entire operation, including retries. This is the most important timeout and the most commonly forgotten.
# Without deadline: each retry gets full timeout
# 3 retries × 10s timeout = 30s total (worst case)
# With deadline: total operation time is bounded
deadline = time.time() + 5.0 # 5 seconds from now
while time.time() < deadline:
try:
return make_request(timeout=min(2.0, deadline - time.time()))
except TimeoutError:
continue
raise DeadlineExceeded()A common mistake is configuring per-request timeout without an overall deadline. Your users don’t give a damn that each retry was 10 seconds, they give a damn when they waited for 40 seconds before seeing an error.
Anatomy of a Timeout Incident
Let me walk you through timeout misconfiguration causes real outages.
Setup
Service A calls Service B, which calls Service C. Each service has 30-second timeout configured. Service C is actually a DB that normally responds in 50ms.
Trigger
Service C gets slow. A bad query plan, disk pressure, whatever. Response times go from 50ms to 35 seconds.
What Happens
Service C takes 35 seconds to responds (exceeds 30 seconds timeout)
Service B times out, returns error to Service A
Service A’s request takes 30+ seconds, exhausting its thread pool
Service A cannot accept new request: it’s now “down“
Users see Service A as the problem
On-call engineer debugs Service A, finds nothing wrong with its code
45 minutes later, someone checks Service C.
Cascade
Service C was slow: 1 service affected
Service B and A had long timeouts: 2 more services affected
These long timeouts caused thread pool exhaustion and users faced outage
On-call engineer invested wrong service first: this extended incident duration
Fix
Service A’s timeout to Service B should be 3 seconds, not 30. If the response is not coming in 3 seconds, fail fast and show errors to the users. It is foolish to let one slow dependency take down everything damn thing.
Exponential Backoff With Jitter
When a request fails, the instinct is to retry immediately. This is almost always wrong.
If the server is overloaded, retrying immediately pile more load on an already struggling system. Imagine what would happen if a thousand clients all retry all at same instant, you get a “thundering herd“ that makes recovery harder.
Exponential backoff is a technique that increases wait time between retries:
Attempt 1: wait 1 second
Attempt 2: wait 2 seconds
Attempt 3: wait 4 seconds
Attempt 4: wait 8 secondsThis approach gives server breathing room to recover.
But there is a problem: if all clients use the same backoff schedule, they will still retry in sync, just less frequently. Therefore, a thousand clients all waiting “2 seconds” still produce a spike at t=2s .
Jitter is something that adds randomness to break synchronization:
import random
import time
def retry_with_backoff(func, max_retries=5, base_delay=1.0, max_delay=60.0):
"""
Retry a function with exponential backoff and full jitter.
Args:
func: Function to call (should raise exception on failure)
max_retries: Maximum retry attempts
base_delay: Initial delay in seconds
max_delay: Maximum delay cap
Returns:
Result of successful function call
Raises:
Last exception if all retries exhausted
"""
last_exception = None
for attempt in range(max_retries):
try:
return func()
except Exception as e:
last_exception = e
if attempt == max_retries - 1:
raise
# Exponential backoff with full jitter
max_wait = min(base_delay * (2 ** attempt), max_delay)
actual_wait = random.uniform(0, max_wait)
time.sleep(actual_wait)
raise last_exceptionWhy “Full Jitter“?
We pick a random value between 0 and max_delay . This spreads retries uniformly across the backoff window. Some clients retry quickly, some wait longer, and the synchronized spikes disappear.
Let’s see the math:
Attempt 1: uniform(0, 1) → average 0.5s
Attempt 2: uniform(0, 2) → average 1.0s
Attempt 3: uniform(0, 4) → average 2.0s
Attempt 4: uniform(0, 8) → average 4.0sEach client follows a different random path. The aggregate retry traffic smoothes out instead of spiking.
Deadline Propagation
Here’s a scenario that catches even experienced engineers off-guard.
Your API has a 5-second deadline. It calls three services sequentially:
Service A: 2-second timeout
Service B: 2-second timeout
Service C: 2-second timeout
Total possible time: 6 seconds. But your deadline is 5 seconds.
If Service A is slow (takes 1.9s), you have burned almost 2 seconds. Service B and C now have to complete in 3.1 seconds combined. But they still think that they have 2 seconds each.
Deadline Propagation passes the remaining time budget to each downstream call:
import time
from contextlib import contextmanager
@contextmanager
def deadline_context(deadline_timestamp):
"""Context manager that tracks remaining deadline."""
ctx = {'deadline': deadline_timestamp}
yield ctx
def call_with_deadline(ctx, service_func, min_timeout=0.1):
"""Call a service respecting the propagated deadline."""
remaining = ctx['deadline'] - time.time()
if remaining <= 0:
raise DeadlineExceeded("No time remaining")
timeout = max(remaining, min_timeout)
return service_func(timeout=timeout)
# Usage
def handle_request():
deadline = time.time() + 5.0 # 5 seconds from now
with deadline_context(deadline) as ctx:
# Each call gets remaining time, not fixed timeout
result_a = call_with_deadline(ctx, service_a.call)
result_b = call_with_deadline(ctx, service_b.call)
result_c = call_with_deadline(ctx, service_c.call)
return combine(result_a, result_b, result_c)Now, if Service A takes 1.9 seconds, Service B and C share the remaining 3.1 seconds. If Service A takes 4 seconds, Service B and C get a total of 1 second which might mean failing fast instead of trying.
gRPC does this automatically with its deadline propagation. When you set a deadline on a gRPC call, it is passed in headers to downstream services, and they can read the remaining time.
If you are using REST, you need to implement this yourself (often via headers like X-Deadline or X-Request-Timeout ).
Preventing Retry Amplification
Retries seem harmless. The request failed; try again. But in distributed systems, retries multiply.
Consider a call chain: A → B → C → D.
D fails 50% of requests
C retries failed D calls (2 attempts each)
B retries failed C calls (2 attempts each)
A retries failed B calls (2 attempts each)
If every layer retries twice, a single user request can generate:
User → A: 1 request
A → B: up to 2 requests
B → C: up to 4 requests
C → D: up to 8 requestsOne user request generates eight requests to D. If you have 1000 concurrent users, D receives 8000 requests. This is called retry amplification, and it turns minor problems into major outages.
Solution: Retry Budget
Instead of “retry upto 2 times per request”, think “retry upto 10% of total requests“.
import threading
import time
class RetryBudget:
"""
Limits retries to a percentage of total requests.
If we are making 1000 requests/second, and budget is 0.1 (10%), we allow at most 100 retries/second.
"""
def __init__(self, budget_ratio=0.1, window_seconds=10):
self.budget_ratio = budget_ratio
self.window_seconds = window_seconds
self.requests = []
self.retries = []
self.lock = threading.Lock()
def record_request(self):
"""Call this for every request (including retries)."""
with self.lock:
now = time.time()
self.requests.append(now)
self._cleanup(now)
def can_retry(self) -> bool:
"""Returns True if retry budget allows another retry."""
with self.lock:
now = time.time()
self._cleanup(now)
total_requests = len(self.requests)
total_retries = len(self.retries)
if total_requests == 0:
return True
current_ratio = total_retries / total_requests
return current_ratio < self.budget_ratio
def record_retry(self):
"""Call this when performing a retry."""
with self.lock:
self.retries.append(time.time())
def _cleanup(self, now):
"""Remove old entries outside the window."""
cutoff = now - self.window_seconds
self.requests = [t for t in self.requests if t > cutoff]
self.retries = [t for t in self.retries if t > cutoff]
# Usage
retry_budget = RetryBudget(budget_ratio=0.1)
def make_request_with_budget(func):
retry_budget.record_request()
try:
return func()
except RetryableError:
if retry_budget.can_retry():
retry_budget.record_retry()
retry_budget.record_request()
return func() # One retry
else:
raise # Budget exhausted, fail fastNow under heavy failure conditions, retries are capped at 10% of traffic instead of multiplying uncontrollably.
Circuit Breakers
Retries assume that failure is transient: try again and it might work. But what if the downstream is completely down? Retrying just wastes resources and delays the inevitable failure.
Circuit Breakers detect sustained failures and “trip”, rejecting requests immediately without even attempting the call.
import time
import threading
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation, requests flow through
OPEN = "open" # Failing, reject requests immediately
HALF_OPEN = "half_open" # Testing if service recovered
class CircuitBreaker:
def __init__(
self,
failure_threshold=5, # Failures before opening
recovery_timeout=30.0, # Seconds before trying again
success_threshold=2 # Successes to close again
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.success_threshold = success_threshold
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self.lock = threading.Lock()
def call(self, func):
"""Execute function through circuit breaker."""
with self.lock:
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
else:
raise CircuitOpenError("Circuit breaker is open")
try:
result = func()
self._record_success()
return result
except Exception as e:
self._record_failure()
raise
def _should_attempt_reset(self):
"""Check if enough time passed to try again."""
if self.last_failure_time is None:
return True
return time.time() - self.last_failure_time >= self.recovery_timeout
def _record_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
def _record_failure(self):
with self.lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
self.success_count = 0
elif self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPENThe circuit breaker has three states:
1. CLOSED
This is a normal operation. Request go through and failures are counted.
2. OPEN
Too many failures occurred. Requests are rejected immediately with CircuitOpenError . This is “fail fast“ which means don’t waste time on a request that will probably fail.
3. HALF_OPEN
After the recovery timeout, we allow a few requests through. If they succeed, close the circuit. If they fail, reopen it.
This approach protects both your service (don’t exhaust resources on doomed requests) and the downstream service (don’t pile more load on a struggling system).
Putting It All Together: A Production Ready Client
Below is how these patterns combine in a real HTTP client:
import time
import random
import requests
from functools import wraps
class ResilientClient:
def __init__(
self,
base_url,
connect_timeout=3.0,
read_timeout=10.0,
max_retries=3,
retry_budget_ratio=0.1,
circuit_failure_threshold=5,
circuit_recovery_timeout=30.0
):
self.base_url = base_url
self.connect_timeout = connect_timeout
self.read_timeout = read_timeout
self.max_retries = max_retries
self.circuit_breaker = CircuitBreaker(
failure_threshold=circuit_failure_threshold,
recovery_timeout=circuit_recovery_timeout
)
self.retry_budget = RetryBudget(budget_ratio=retry_budget_ratio)
def get(self, path, deadline=None, **kwargs):
"""
Make a GET request with full resilience patterns.
Args:
path: URL path to request
deadline: Absolute timestamp by which request must complete
**kwargs: Additional arguments to requests.get
"""
deadline = deadline or (time.time() + self.read_timeout + 5)
url = f"{self.base_url}{path}"
def make_request():
remaining = deadline - time.time()
if remaining <= 0:
raise DeadlineExceeded("Request deadline exceeded")
timeout = (
self.connect_timeout,
min(self.read_timeout, remaining)
)
response = requests.get(url, timeout=timeout, **kwargs)
response.raise_for_status()
return response
# Execute through circuit breaker
return self._execute_with_retry(make_request, deadline)
def _execute_with_retry(self, func, deadline):
"""Execute function with retry logic."""
last_exception = None
for attempt in range(self.max_retries):
self.retry_budget.record_request()
try:
return self.circuit_breaker.call(func)
except CircuitOpenError:
raise # Don't retry if circuit is open
except (requests.Timeout, requests.ConnectionError) as e:
last_exception = e
# Check if we should retry
if attempt == self.max_retries - 1:
raise
if not self.retry_budget.can_retry():
raise RetryBudgetExhausted("Retry budget exhausted")
if time.time() >= deadline:
raise DeadlineExceeded("No time for retry")
# Backoff with jitter
self.retry_budget.record_retry()
max_wait = min(1.0 * (2 ** attempt), deadline - time.time())
if max_wait > 0:
time.sleep(random.uniform(0, max_wait))
except requests.HTTPError as e:
# Don't retry client errors (4xx)
if 400 <= e.response.status_code < 500:
raise
last_exception = e
# Retry server errors (5xx)
raise last_exceptionFeatures of this client:
Uses appropriate timeouts for connection vs read
Propagates deadlines to bound total operation time
Retries with exponential backoff and jitter
Respects retry budgets to prevent amplification
Uses circuit breaker to fail fast on sustained failures
Distinguishes retryable errors (timeouts, 5xx) from non-retryable (4xx)
Checklist: Get Your Timeouts Right
Before deploying any service that makes network calls, use this checklist:
Timeouts
Connection timeout is short (1-5 seconds)
Read timeout matches the actual performance of the downstream service
Overall deadline bounds total user-facing latency
Timeouts are configured per-dependency, not globally
Retries
Using exponential backoff (not immediate retry)
Using jitter (not synchronized backoff)
Retry budget caps total retry ratio
Only retrying idempotent operations (or using idempotency keys)
Not retrying client errors (4xx)
Circuit Breakers
Circuit breaker on external dependencies
Appropriate failure threshold (not too sensitive, not too tolerant)
Recovery timeout gives downstream time to recover
Monitoring/alerting when circuit opens
Observability
Logging timeout values in timeout errors
Metrics on retry rate and circuit breaker state
Dashboards showing p99 latency per dependency
Takeaway
Distributed systems fail; it’s their fundamental property. Networks are unreliable, services get overloaded, and dependencies go down.
Your job is to handle this failure gracefully. Short timeouts mean you find out fast. Exponential backoff with jitter means you don’t make problems worse. Deadlines mean users get errors in seconds, not minutes. Retry budgets mean one failing service doesn’t cascade everywhere. Circuit breakers mean you stop trying when it’s hopeless.
This is not complicated at all. All the concepts you need to know fit in one article. But the difference between systems that implement these patterns and systems that don’t is the difference between a 5-minute incident and a multi-hour outage.
Get your timeouts right. Your on-call rotation will thank you.
What’s the worst timeout-related incident you have seen? I collect these stories because they teach more than any architecture diagram.
Hit reply and tell me. I read every response.
If this was useful, forward it to your team. These patterns should be muscle memory for every engineer building distributed systems. And if you want more deep dives like this, subscribe to The Main Thread, one practical engineering essay per week.
