Hey curious engineers, welcome to the twenty third issue of The Main Thread.

Wrong timeout settings cause more outages than bugs.

I know that sounds dramatic. But think about it: your code can be perfect, your logic flawless, your tests detailed, and your system still fails because someone has set a timeout to 30 seconds when it should have been 3.

In my career, I have seen a 5-minute outage turn into a two hour cascading failure because retry settings amplified a small problem into a catastrophe. As an on-call engineer, I have felt the pain directly by looking at dashboards showing “everything is timing out“ with no idea which timeout is causing all of this. I have also debugged incidents where changing a single number in the config file fixed the issue.

To be honest, this stuff is not glamorous and it doesn’t get conference talks, yet it is the difference between a robust and a flaky system.

Let’s fix that.

Four Timeout Types

When someone says “the timeout is 10 seconds“, they could mean 4 different things. Each has different failure modes, and confusing them causes real problems.

1. Connection Timeout

This is how long you wait to establish a connection (TCP handshake). If the server is down, overloaded, or unreachable, you will encounter this timeout.

# (connect_timeout, read_timeout)
requests.get(url, timeout=(3.0, 10.0))  
# The first value is connection timeout

Typical value is 1-5 seconds. Connection establishment should be fast. If the server can’t accept your connection in 3 seconds, something is seriously long.

A common mistake is setting this too high. A 30-second connection timeout means your thread sits blocked for 30 seconds before finding that the server is unreachable. If you multiply this by hundreds of requests, you will exhaust your thread pool.

2. Read Timeout

This is how long you wait for the server to send data once the connection is established. This covers the time from “request sent“ to the “first byte received“ and between subsequent chunks.

# The second value is read timeout
requests.get(url, timeout=(3.0, 10.0))

It’s typical values depend on what you are calling. A cache lookup should respond in 50ms. A complex DB query might need 5s. An ML inference call might need 30 seconds.

A common mistake I have seen engineers make is using the same read timeout for everything. Your cache client and your report-generation service have wildly different performance characteristics. You must treat them differently.

3. Write Timeout

This timeout explains how long to wait while sending data to the server. It is less common but comes into picture for large uploads or slow networks.

Naturally, it’s proportional to the size of the payload. A small JSON body doesn’t need much time but a large file upload needs significantly more.

4. Overall Deadline

This is the total time allowed for an entire operation, including retries. This is the most important timeout and the most commonly forgotten.

# Without deadline: each retry gets full timeout
# 3 retries × 10s timeout = 30s total (worst case)

# With deadline: total operation time is bounded
deadline = time.time() + 5.0  # 5 seconds from now
while time.time() < deadline:
    try:
        return make_request(timeout=min(2.0, deadline - time.time()))
    except TimeoutError:
        continue
raise DeadlineExceeded()

A common mistake is configuring per-request timeout without an overall deadline. Your users don’t give a damn that each retry was 10 seconds, they give a damn when they waited for 40 seconds before seeing an error.

Anatomy of a Timeout Incident

Let me walk you through timeout misconfiguration causes real outages.

Setup

Service A calls Service B, which calls Service C. Each service has 30-second timeout configured. Service C is actually a DB that normally responds in 50ms.

Trigger

Service C gets slow. A bad query plan, disk pressure, whatever. Response times go from 50ms to 35 seconds.

What Happens

Service C takes 35 seconds to responds (exceeds 30 seconds timeout)
Service B times out, returns error to Service A
Service A’s request takes 30+ seconds, exhausting its thread pool
Service A cannot accept new request: it’s now “down“
Users see Service A as the problem
On-call engineer debugs Service A, finds nothing wrong with its code
45 minutes later, someone checks Service C.

Cascade

Service C was slow: 1 service affected
Service B and A had long timeouts: 2 more services affected
These long timeouts caused thread pool exhaustion and users faced outage
On-call engineer invested wrong service first: this extended incident duration

Fix

Service A’s timeout to Service B should be 3 seconds, not 30. If the response is not coming in 3 seconds, fail fast and show errors to the users. It is foolish to let one slow dependency take down everything damn thing.

Exponential Backoff With Jitter

When a request fails, the instinct is to retry immediately. This is almost always wrong.

Timeouts, Retries, and Deadlines

Four Timeout Types

1. Connection Timeout

2. Read Timeout

3. Write Timeout

4. Overall Deadline

Anatomy of a Timeout Incident

Setup

Trigger

What Happens

Cascade

Fix

Exponential Backoff With Jitter

Reply

Keep Reading

Liked it? Subscribe.

Timeouts, Retries, and Deadlines

Four Timeout Types

1. Connection Timeout

2. Read Timeout

3. Write Timeout

4. Overall Deadline

Anatomy of a Timeout Incident

Setup

Trigger

What Happens

Cascade

Fix

Exponential Backoff With Jitter

Subscribe to keep reading

Reply

Keep Reading

Liked it? Subscribe.

The Main Thread