One server isn't enough. Now what? You get two answers — and they take your system in completely different directions. The choice you make today shapes the next five years of architecture.
Back in Module 02, you cranked up the clients on a single server and watched the queue overflow. That was the universe telling you something: every single machine has a ceiling. CPU runs out. Memory fills up. Network port saturates. Disk I/O bottlenecks. Once you hit any one of these, your perfectly designed system stops responding. The question isn't whether you'll hit that wall — it's what you do when you do.
The fundamental insight is this: computers don't grow gradually. They saturate and break. A server at 80% CPU runs fine. A server at 99% CPU is on the edge. A server at 100% is unreachable. There's no smooth degradation. So when traffic grows — and it always does — you need a plan. Two plans exist. They lead to different futures.
There are exactly two ways to handle more load. You either get a bigger machine (scale up, also called vertical scaling) — more CPU, more RAM, faster disk on the same server. Or you get more machines (scale out, also called horizontal scaling) — distribute work across many smaller servers. They sound like a minor distinction. They're not. They lead to two completely different architectures.
The right move is usually obvious in hindsight. Early-stage products scale up: it's cheaper, simpler, and you don't know yet whether your product will need anything else. Mature products scale out: traffic is high enough that no single machine could ever hold it, and you've earned the engineering investment by then. The art is knowing when to make the switch — usually before you have to.
Horizontal scaling has a giant hidden requirement: each request must be independent. If your code remembers things between requests — sessions in memory, locks, in-process counters — those memories live on one specific server. Send the next request to a different server, and the memory is gone. This is the difference between stateless and stateful systems, and it determines what scales easily and what doesn't.
The lesson: horizontal scaling rewards stateless design. Push all the "remembering" out of your app servers — into databases, caches, and tokens — and your app tier becomes trivial to scale. Keep it in memory and you've trapped yourself on one box. This is why every "12-factor app" guide preaches statelessness. It's also why scaling databases is a separate, harder problem we'll touch on later.
Below: a side-by-side view of both strategies under the same traffic. Pick a load level — light to massive — and watch the vertical box swell, the horizontal grid multiply. Then read the numbers. By the time you hit 1M req/s, only one of these paths is still viable.
At 1K req/s, both strategies are viable. Vertical is slightly cheaper (one small box vs two), but horizontal already buys you redundancy. Most teams start here with vertical scaling — it's the simpler path until you have a reason to change.
For small traffic, vertical scaling is cheaper. One $50 box beats two $30 boxes. But the cost curve for vertical scaling is brutally non-linear: each step up the tier roughly doubles the price for slightly more performance. A 16x bigger server doesn't cost 16x more — it costs 50x or 100x more. Cloud providers know exactly how valuable that top-tier box is, and price it accordingly.
Look at where the lines diverge. Up to about 10K req/s they're nearly identical. At 100K, vertical is starting to balloon — you're buying expensive big-iron tiers. At 500K, vertical is hitting six figures monthly. At 1M req/s, vertical literally can't get there — no single off-the-shelf machine can do it. Horizontal stays roughly linear: 2× the traffic ≈ 2× the boxes ≈ 2× the cost.
This is why every system at internet scale is horizontal. Not because vertical is wrong — for small teams it's often right. But because the wall is real, and pretending it isn't is the most expensive way to discover it. Most mature systems use both: scale up while you can, then scale out when you must. Knowing where that line is — that's the engineering judgment.
You'll see these in every capacity-planning meeting from here on. Learn them.
N replicas = traffic divided by N, plus redundancy. Read replicas vs write replicas behave very differently.Test the scaling intuition you just built. Click an answer; explanation appears immediately.
You see the two futures clearly. Next stop: load balancing — how horizontal scaling actually distributes the work.
This module shapes how every scaling conversation goes for the rest of your career.
Vertical scaling: simple but bounded. Horizontal scaling: complex but practically infinite. Pick based on where you are on the curve.
Move sessions, locks, and counters out of your app servers. Make every request self-contained. Then horizontal becomes trivial.
Start vertical. Switch to horizontal when the price curve forces you. Databases are a separate harder problem — we'll come back to that.