Every system has weak points. Every dependency can fail. The discipline of building reliable systems isn't about preventing failure — it's about choosing how it spreads when it happens.
"We have 99.9% availability" sounds wonderful. Marketing departments love three nines. Now multiply: your service depends on 5 other services, each also at 99.9%. Your effective availability is 0.9995 = 99.5%. That sounds close to 99.9% until you convert: 99.9% = 8.7 hours of downtime per year. 99.5% = 43 hours per year. Five times worse, just from a few innocent-looking dependencies.
Each extra nine costs roughly 10× more than the previous one. Getting from three nines to four nines might mean spending the engineering budget twice over on automation, redundancy, and runbooks. Getting from four to five is where companies hire entire SRE teams. You don't decide to be five-nines; you build a system that can support it for a specific reason. Most consumer apps target three nines. Most banks target four. Telephone networks target five and very few others do.
So what makes one system three-nine reliable and another two-nine? Three things, in order of importance: (1) eliminate single points of failure, (2) fail gracefully when things break, and (3) don't make your own failures worse by retrying badly. The next three sections walk through each.
A single point of failure (SPOF) is any component whose death takes the whole system down. The single database. The one load balancer. The shared message queue. The cron host. Finding SPOFs is a discipline — you walk through every component and ask "if this dies, what happens?" If the answer is "everything stops," it's a SPOF. The fix is almost always the same: have more than one of it.
The list of SPOFs to hunt down is long: load balancers (run multiple, with DNS failover), databases (primary + replica, M.09 again), caches (Redis Sentinel / Cluster), cron jobs (use a scheduler with leader election), deployment hosts (don't deploy from one workstation), and DNS itself (multiple nameservers). The work is rarely glamorous but it's where the actual nines come from. Every component, twice.
One subtle SPOF that catches teams: the entire region or availability zone. Running 5 web servers in us-east-1a is great until that AZ has a power event. Spread across 2 or 3 AZs and your service survives a whole data center going dark. Multi-region is harder (latency, consistency tradeoffs) but for top-tier reliability it's the same logic taken further.
Most production outages come not from your code crashing but from something else's code crashing — a downstream service, a third-party API, a database under load. If your service handles those failures gracefully, an "outage" looks like "degraded experience for one feature." If it doesn't, the same trigger takes down your whole site. Three patterns handle 90% of this. Learn them once, use them forever.
Every network call gets a deadline. fetch(url, timeout: 5s) — not optional. The default for most HTTP libraries is infinite, which is wrong. Always set explicit timeouts shorter than your own SLO; otherwise a slow downstream service holds your threads hostage indefinitely.
Transient errors (network blip, brief overload) often clear in milliseconds. A single retry recovers most of them. But retries need backoff (wait longer between attempts) and jitter (randomize the wait) — otherwise all your clients retry simultaneously and crush the recovering service. We'll see that in §05.
When a downstream is clearly failing, stop trying it for a while. After N failures, the circuit "opens" — subsequent calls fail fast instead of waiting for timeout. After a cooldown, try one call (half-open). If it works, close the circuit. This is what the lab below will show.
The mental model: timeouts protect you from slow failures, retries help with brief failures, circuit breakers protect you from sustained failures. They compose. You set a timeout (so calls don't hang), wrap the call in a retry-with-backoff (for transient errors), and put a circuit breaker around the whole thing (so sustained failure of the downstream doesn't drag you down). All three layers, every external call.
Below: an API server with a thread pool of 8, handling requests that need either the database (left) or the payment service (right). Hit Kill payment and watch what happens without protection — payment calls hang on timeout, threads stay busy, eventually nothing else can get served either. Then toggle Enable protection: timeouts, circuit breaker, fallback. Same failure, totally different system behavior.
Baseline. Requests flow through the API to DB (70% of traffic) or Payment (30%). All 8 threads cycle quickly. Click "Kill payment" to inject a failure — then watch the system behavior shift dramatically depending on whether protection is enabled.
Retries seem obviously helpful — try again if it failed. But naive retries are how brief outages become long ones. The pattern: a downstream service hiccups; thousands of clients all retry immediately; the recovering service gets 3× its normal load right when it's weakest; it crashes again. This is called a retry storm or thundering herd, and it's one of the most common ways teams accidentally extend their own outages.
while (true) {
const result = await call();
if (result.ok) return result;
// try again immediately
}
let delay = 100; // ms
for (let i = 0; i < 5; i++) {
const r = await call();
if (r.ok) return r;
await sleep(delay + random(0, delay));
delay *= 2; // 100, 200, 400, 800, 1600
}
throw new Error("gave up");
The jitter detail matters enormously. Without it, if 1000 clients all fail at time T, they'd all wait exactly 100ms and retry at T+100, causing another simultaneous wave. With jitter, they retry across a spread of 100-200ms, smoothing the load. This is the same idea we touched on with DNS TTLs — randomization prevents synchronized stampedes.
An operation is idempotent if running it twice has the same effect as running it once. GET /user/42 is idempotent (reading the same user twice gives the same answer). POST /charge is not idempotent — charging twice means charging twice. You can only safely retry idempotent operations.
For operations that aren't naturally idempotent (payments, signups, sending emails), the standard fix is an idempotency key: the client generates a unique ID and includes it; the server records "I've processed this key already" so duplicate requests return the original result instead of re-running. Stripe's API is the canonical example. Always design retry-prone operations to be idempotent.
The complete picture for any unreliable downstream call: timeout (so it can't hang) + retry with backoff + jitter (limited to 3 attempts, only for idempotent ops) + circuit breaker (stop trying when sustained failure). Each layer compounds. Build this once as a library or middleware; reuse it on every external call. Your future on-call rotations get noticeably calmer.
You'll meet these in every postmortem, every chaos-engineering exercise, every architecture review involving the word "uptime."
Last quiz of Phase C. Click an answer; explanation drops in instantly.
Phase C is yours. You can scale, observe, and survive. Now we go build — the next four modules apply everything to real systems.
The reliability mindset shows up everywhere from this point on.
Every layer, twice. Databases, caches, load balancers, AZs. Without N+1 redundancy, every reliability pattern after is decorative.
Three patterns, every external call. Timeouts so calls don't hang. Retries with backoff for transient errors. Circuit breakers for sustained failure.
Naive retries turn brief outages into long ones. Backoff + jitter + idempotency keys are how you retry without making things worse.