Reliability Engineering — SLAs, SLOs, Error Budgets, Failover

§ 01 — The math of nines

99% sounds great
until you do the math.

"We have 99.9% availability" sounds wonderful. Marketing departments love three nines. Now multiply: your service depends on 5 other services, each also at 99.9%. Your effective availability is 0.999⁵ = 99.5%. That sounds close to 99.9% until you convert: 99.9% = 8.7 hours of downtime per year. 99.5% = 43 hours per year. Five times worse, just from a few innocent-looking dependencies.

// AVAILABILITY % → ACTUAL DOWNTIME PER YEAR

90%

36.5 days

99%

3.65 days

99.9%

8.76 hours

99.99%

52.6 min

99.999%

5.26 min

Each extra nine costs roughly 10× more than the previous one. Getting from three nines to four nines might mean spending the engineering budget twice over on automation, redundancy, and runbooks. Getting from four to five is where companies hire entire SRE teams. You don't decide to be five-nines; you build a system that can support it for a specific reason. Most consumer apps target three nines. Most banks target four. Telephone networks target five and very few others do.

Reliability isn't a feature you add. It's a property that emerges from how you handle failure at every layer.

So what makes one system three-nine reliable and another two-nine? Three things, in order of importance: (1) eliminate single points of failure, (2) fail gracefully when things break, and (3) don't make your own failures worse by retrying badly. The next three sections walk through each.

§ 02 — Eliminating single points of failure

If one box can
break everything.

A single point of failure (SPOF) is any component whose death takes the whole system down. The single database. The one load balancer. The shared message queue. The cron host. Finding SPOFs is a discipline — you walk through every component and ask "if this dies, what happens?" If the answer is "everything stops," it's a SPOF. The fix is almost always the same: have more than one of it.

// SPOF vs REDUNDANCY · SAME ARCHITECTURE, TWO FUTURES

// SINGLE POINT OF FAILURE

One database, one fate

Three healthy app servers — but they all depend on one DB. DB fails → everything fails. The horizontal scaling at the app tier achieved nothing for reliability because the bottleneck moved one layer down.

// REDUNDANT · N+1

Primary + replica, automatic failover

DB has a hot standby. If primary dies, traffic shifts to replica in seconds. The pattern: N+1. Need 3 servers for capacity? Run 4. Need 1 database? Run 2 with replication. Cost: a bit more money. Benefit: an outage stops being an existential event.

The list of SPOFs to hunt down is long: load balancers (run multiple, with DNS failover), databases (primary + replica, M.09 again), caches (Redis Sentinel / Cluster), cron jobs (use a scheduler with leader election), deployment hosts (don't deploy from one workstation), and DNS itself (multiple nameservers). The work is rarely glamorous but it's where the actual nines come from. Every component, twice.

One subtle SPOF that catches teams: the entire region or availability zone. Running 5 web servers in us-east-1a is great until that AZ has a power event. Spread across 2 or 3 AZs and your service survives a whole data center going dark. Multi-region is harder (latency, consistency tradeoffs) but for top-tier reliability it's the same logic taken further.

§ 03 — Three patterns for graceful failure

When something else
breaks, don't hang.

Most production outages come not from your code crashing but from something else's code crashing — a downstream service, a third-party API, a database under load. If your service handles those failures gracefully, an "outage" looks like "degraded experience for one feature." If it doesn't, the same trigger takes down your whole site. Three patterns handle 90% of this. Learn them once, use them forever.

// PATTERN 1

Timeouts

"Never wait forever"

Every network call gets a deadline. fetch(url, timeout: 5s) — not optional. The default for most HTTP libraries is infinite, which is wrong. Always set explicit timeouts shorter than your own SLO; otherwise a slow downstream service holds your threads hostage indefinitely.

// PATTERN 2

Retries

"Try again, but smarter"

Transient errors (network blip, brief overload) often clear in milliseconds. A single retry recovers most of them. But retries need backoff (wait longer between attempts) and jitter (randomize the wait) — otherwise all your clients retry simultaneously and crush the recovering service. We'll see that in §05.

// PATTERN 3

Circuit breakers

"Stop calling what's broken"

When a downstream is clearly failing, stop trying it for a while. After N failures, the circuit "opens" — subsequent calls fail fast instead of waiting for timeout. After a cooldown, try one call (half-open). If it works, close the circuit. This is what the lab below will show.

The mental model: timeouts protect you from slow failures, retries help with brief failures, circuit breakers protect you from sustained failures. They compose. You set a timeout (so calls don't hang), wrap the call in a retry-with-backoff (for transient errors), and put a circuit breaker around the whole thing (so sustained failure of the downstream doesn't drag you down). All three layers, every external call.

§ 04 — The cascading failure simulator · interactive lab

Now watch
one death become many.

Below: an API server with a thread pool of 8, handling requests that need either the database (left) or the payment service (right). Hit Kill payment and watch what happens without protection — payment calls hang on timeout, threads stay busy, eventually nothing else can get served either. Then toggle Enable protection: timeouts, circuit breaker, fallback. Same failure, totally different system behavior.

CASCADE.SIM // m.16 lab

// Live metrics

Success

100%

Threads busy

0/8

Total served

0

Total failed

0

Circuit breaker N/A · no protection

// Request types · per second

DB-only req

—

Payment req

—

// VERDICT

All systems healthy

Baseline. Requests flow through the API to DB (70% of traffic) or Payment (30%). All 8 threads cycle quickly. Click "Kill payment" to inject a failure — then watch the system behavior shift dramatically depending on whether protection is enabled.

§ 05 — Retry storms & idempotency

Retries are great
until they're not.

Retries seem obviously helpful — try again if it failed. But naive retries are how brief outages become long ones. The pattern: a downstream service hiccups; thousands of clients all retry immediately; the recovering service gets 3× its normal load right when it's weakest; it crashes again. This is called a retry storm or thundering herd, and it's one of the most common ways teams accidentally extend their own outages.

// TWO WAYS TO RETRY · ONE IS A BAD IDEA

// NAIVE — DON'T DO THIS

Retry immediately, forever

while (true) {
  const result = await call();
  if (result.ok) return result;
  // try again immediately
}

When 10,000 clients run this against a struggling service, they all hammer it at full rate. The service can't catch up; the retries crowd out anyone else; an outage that should have been 5 seconds becomes 5 hours. Retries amplify failures.

// SMART — DO THIS

Exponential backoff + jitter

let delay = 100; // ms
for (let i = 0; i < 5; i++) {
  const r = await call();
  if (r.ok) return r;
  await sleep(delay + random(0, delay));
  delay *= 2;  // 100, 200, 400, 800, 1600
}
throw new Error("gave up");

Wait 100ms (plus 0-100ms random), then 200ms (plus 0-200ms), then 400ms... The exponential growth reduces load fast; the jitter spreads retries so clients don't all attempt at the same moment. Cap retries at 3-5 — past that, give up.

The jitter detail matters enormously. Without it, if 1000 clients all fail at time T, they'd all wait exactly 100ms and retry at T+100, causing another simultaneous wave. With jitter, they retry across a spread of 100-200ms, smoothing the load. This is the same idea we touched on with DNS TTLs — randomization prevents synchronized stampedes.

// THE PREREQUISITE FOR ANY RETRY

Idempotency

An operation is idempotent if running it twice has the same effect as running it once. GET /user/42 is idempotent (reading the same user twice gives the same answer). POST /charge is not idempotent — charging twice means charging twice. You can only safely retry idempotent operations.

For operations that aren't naturally idempotent (payments, signups, sending emails), the standard fix is an idempotency key: the client generates a unique ID and includes it; the server records "I've processed this key already" so duplicate requests return the original result instead of re-running. Stripe's API is the canonical example. Always design retry-prone operations to be idempotent.

The complete picture for any unreliable downstream call: timeout (so it can't hang) + retry with backoff + jitter (limited to 3 attempts, only for idempotent ops) + circuit breaker (stop trying when sustained failure). Each layer compounds. Build this once as a library or middleware; reuse it on every external call. Your future on-call rotations get noticeably calmer.

§ 06 — Eight words for the reliability layer

Vocabulary,
for the long night.

You'll meet these in every postmortem, every chaos-engineering exercise, every architecture review involving the word "uptime."

SPOF

/spɒf/

"Single Point of Failure." Any component whose loss takes the system down. The first thing reliability work hunts for. Fix: redundancy (N+1).

Circuit Breaker

/ˈsɜːkɪt ˌbreɪkə/

A wrapper around a flaky downstream call. After N failures it "opens" and short-circuits subsequent calls to fail fast. States: closed (normal), open (failing), half-open (testing recovery).

Exponential Backoff

/ɪkspəˈnɛnʃəl/

A retry pattern where each subsequent wait doubles: 100ms, 200ms, 400ms... Reduces load on a struggling service while still recovering from transient errors.

Jitter

/ˈdʒɪtə/

Randomization added to retry delays so clients don't all retry at exactly the same moment. Prevents synchronized retry storms.

Idempotency

/ˌaɪdɛmˈpoʊtənsi/

A property where running an operation twice has the same effect as once. The prerequisite for safe retries. Implemented via deterministic semantics or an idempotency key.

Bulkhead

/ˈbʌlkhɛd/

Isolating resources (thread pools, connection pools) per dependency so one slow downstream can't exhaust shared resources and bring down the whole service.

Graceful Degradation

/ˈɡreɪsfəl/

When some component fails, the system serves a reduced experience instead of crashing entirely. Recommendations down? Show popular items. Payment down? Save the cart.

Chaos Engineering

/ˈkeɪɒs/

Deliberately injecting failures in production (or staging) to verify your reliability patterns actually work. Netflix's Chaos Monkey is the famous example.

§ 07 — Knowledge check

Five questions.
Mind the cascade.

Last quiz of Phase C. Click an answer; explanation drops in instantly.

QUESTION 1 OF 5

Loading question...

Score: 0 / 5

5 / 5

Hardened.

Phase C is yours. You can scale, observe, and survive. Now we go build — the next four modules apply everything to real systems.

§ 08 — The recap

Three ideas to
carry forward.

The reliability mindset shows up everywhere from this point on.

i

Eliminate SPOFs first

Every layer, twice. Databases, caches, load balancers, AZs. Without N+1 redundancy, every reliability pattern after is decorative.

ii

Timeout, retry, circuit-break

Three patterns, every external call. Timeouts so calls don't hang. Retries with backoff for transient errors. Circuit breakers for sustained failure.

iii

Don't amplify failures

Naive retries turn brief outages into long ones. Backoff + jitter + idempotency keys are how you retry without making things worse.

Reliability:
SPOFs, retries,
circuit breakers.

// What you'll know by the end

99% sounds great
until you do the math.

If one box can
break everything.

// SPOF vs REDUNDANCY · SAME ARCHITECTURE, TWO FUTURES

One database, one fate

Primary + replica, automatic failover

When something else
breaks, don't hang.

Timeouts

Retries

Circuit breakers

Now watch
one death become many.

// Live metrics

// Request types · per second

All systems healthy

Retries are great
until they're not.

// TWO WAYS TO RETRY · ONE IS A BAD IDEA

Retry immediately, forever

Exponential backoff + jitter

Idempotency

Vocabulary,
for the long night.

Five questions.
Mind the cascade.

Hardened.

Three ideas to
carry forward.

Eliminate SPOFs first

Timeout, retry, circuit-break

Don't amplify failures

M.17 — Build a
URL shortener.

99% sounds greatuntil you do the math.

If one box canbreak everything.

// SPOF vs REDUNDANCY · SAME ARCHITECTURE, TWO FUTURES

One database, one fate

Primary + replica, automatic failover

When something elsebreaks, don't hang.

Timeouts

Retries

Circuit breakers

Now watchone death become many.

// Live metrics

// Request types · per second

All systems healthy

Retries are greatuntil they're not.

// TWO WAYS TO RETRY · ONE IS A BAD IDEA

Retry immediately, forever

Exponential backoff + jitter

Idempotency

Vocabulary,for the long night.

Five questions.Mind the cascade.

Hardened.

Three ideas tocarry forward.

Eliminate SPOFs first

Timeout, retry, circuit-break

Don't amplify failures

M.17 — Build aURL shortener.

99% sounds great
until you do the math.

If one box can
break everything.

When something else
breaks, don't hang.

Now watch
one death become many.

Retries are great
until they're not.

Vocabulary,
for the long night.

Five questions.
Mind the cascade.

Three ideas to
carry forward.

M.17 — Build a
URL shortener.