You have ten servers. Each client only knows about one address. Something has to bridge those two facts — and decide, ten thousand times a second, which server handles each request. That something is a load balancer, and it's the quiet hero of every modern system.
Module 09 ended with you scaling horizontally: instead of one giant server, you've got 25 modest ones, all running the same code. Excellent. But there's a problem you haven't solved. The client only knows one thing: api.systemdesigntutorial.com. That's a single hostname pointing — historically — to a single server. With 25 servers, what does the address point to now? All of them? None of them? Whichever one's least busy?
The honest answer is: the client neither knows nor cares. Whoever owns api.systemdesigntutorial.com needs to put something in the middle that does the deciding — a small piece of software whose entire job is to receive incoming requests and spray them across the pool of servers behind it. Without this piece, horizontal scaling doesn't actually work. That's the load balancer.
A load balancer is a piece of software (or a piece of hardware, or — most commonly today — a managed cloud service) that sits between clients and a pool of servers. From the outside, it presents itself as a single address. From the inside, it knows about all your servers and forwards each request to one of them, according to some algorithm.
That single picture is most of what you need. Three things the LB does, every single request, ten thousand times a second:
And there's a fourth thing — running quietly in the background — that turns the LB from a switchboard into something actually useful: health checks. Every few seconds, the LB pings each server with GET /health. If a server stops responding, the LB removes it from the pool and stops sending traffic. Suddenly, "server crashed" becomes a non-event instead of an outage. We'll come back to that in §05.
Every load balancer comes with a menu of algorithms. The picks share the same goal — spread the work evenly — but they go about it differently, and the differences matter when servers are uneven, slow, or dying. Click any card to see how each works under the hood.
The simplest algorithm. Keep a counter. For each new request, send it to server[counter % N], then increment. Cycle through forever. Stupid simple, surprisingly effective when servers are uniform.
Track the number of active connections on each server. New request? Send it to whichever server has the fewest in-flight. Naturally avoids slow or struggling servers because their connection count piles up.
Hash the client's IP address; pick a server based on the result. Same client always hits the same server. Useful when you need session affinity — a cart sitting in memory on one specific server.
Round robin, but each server has a weight (its relative capacity). A server weighted 3 gets three turns for every one that S1 gets. Useful when your fleet isn't homogeneous — say, mixing 4-vCPU and 16-vCPU instances.
In practice, most modern load balancers default to Least Connections or a sophisticated variant of it (like "Power of Two Choices" — pick two random servers, send to whichever has fewer connections, mathematically near-optimal with very little state). Round robin is the textbook example, but production setups want something that reacts to actual load.
Below: a live load balancer with 5 servers behind it. Pick an algorithm and hit Start. Watch how requests fan out. Then use the toggles to kill a server or make one slow — and see which algorithms adapt, which ones don't.
Requests cycle through servers 1 → 2 → 3 → 4 → 5 → 1 → … Perfectly even distribution. Now try the disrupt buttons — kill a server or make one slow, and watch which algorithms adapt. Round Robin won't notice. Least Connections will.
The simulator made one thing visible: when a server dies, the LB needs to stop sending traffic to it. The mechanism behind that is mundane but essential — and the other thing nobody mentions about LBs is that they themselves are a single point of failure unless you specifically design around it.
Every few seconds, the LB sends a small request to each server — typically GET /health. If the server returns 200 OK, it stays in the pool. If it returns an error, times out, or doesn't respond at all, the LB marks it unhealthy and stops routing to it.
Apps should expose a real /health endpoint that returns 200 only when they can actually serve traffic — it's checked the database, it has memory free, it isn't shutting down. A bad health check is worse than none: a server that says "healthy" while broken means the LB keeps sending requests into the void.
Modern LBs check every 5–10 seconds with a few seconds of failure required before marking a server down — to avoid flapping on brief blips.
You put a load balancer in front of N servers to remove the single-point-of-failure problem. But now the load balancer is the single point of failure. If the LB dies, every server behind it becomes unreachable. Congratulations: you've moved the SPOF, not removed it.
The fix: run multiple load balancers. Two or more LBs, each capable of forwarding to the same pool. Then use DNS-level routing (or anycast IP, or VIP failover) so that the hostname resolves to whichever LB is currently alive. Managed cloud LBs (AWS ALB, Cloudflare, Google Cloud LB) handle this for you transparently — they run as fleets across availability zones.
The general rule: if removing one thing takes down your whole site, that thing needs a buddy. Apply recursively.
Put together, these two ideas — health checks below the LB, redundancy above it — turn a single load balancer into a self-healing layer. A server crashes? The LB stops using it within seconds. An LB crashes? DNS shifts traffic to the surviving one. The whole system absorbs failures that, on a single-server setup, would have been outages. That's what "highly available" really means: nothing is special enough that its death breaks the system.
You'll meet these in every architecture diagram from here on. Get them comfortable.
GET /health) to confirm it's alive. Failed probes remove the server from the rotation.Test the intuition. Pick an answer; the explanation drops in instantly.
You see what the LB is doing now. Onward to DNS — the layer in front of the load balancer.
The LB is the most invisible part of modern systems — and the one whose absence breaks everything.
Clients hit one hostname. Behind it, the LB picks a server per request and forwards. The cluster behind the curtain is invisible.
Round robin is fine when everything's healthy. Least Connections adapts when servers slow down. IP hash sacrifices flexibility for affinity.
Below the LB: probe each server, drop the dead ones. Above the LB: run multiple LBs with DNS or VIP failover. Then nothing is special.