You've built a distributed system. Now something breaks at 3am. Where do you even start? The difference between "five minutes to recover" and "two hours of panic" is whether your system can show you what it's doing — while it's doing it.
It's the on-call alert: "p99 latency above 1 second for 5 minutes." You squint at your phone, half-awake. The website looks normal. The error rate looks normal. But the latency tells a story — some users are getting service that's 10× slower than usual. Now: where do you start? Your answer depends entirely on what your system shows you when you ask it.
top. CPU high. Why?service=orders level=warnThat gap — two hours vs six minutes, panic vs procedure — isn't about being smarter at 3am. It's about whether your system can tell you what it's doing. When something breaks (and it will), the only thing that matters is how fast you can answer three questions: What is broken? Why? When did it start? Observability is the discipline of building systems that can answer those questions cheaply, repeatedly, and at 3am.
The industry has settled on three complementary signals that, together, make a system observable. They're often called the three pillars: logs, metrics, and traces. They overlap a little, but each one is the right tool for a different question. A good production setup has all three; a great one has them connected so you can pivot from one to the other in a few clicks.
Text records of individual events. One log line per request, per error, per significant action. High detail, hard to summarize. Use logs when you need to know exactly what one specific request did.
Numbers measured over time. Counters, gauges, histograms. Low detail, easy to summarize and graph. Use metrics to see overall system health and detect anomalies.
A request's journey across services. Shows timing for each hop: API → auth → DB → cache. The right tool for distributed systems where one request touches many components.
The metaphor that sticks: metrics tell you something is wrong, logs tell you what specifically went wrong, traces tell you where in the system it went wrong. You start with metrics on a dashboard (the noisiest signal — "p99 just spiked"), drill into traces for a slow request (the path — "the DB call took 2 seconds"), then drop into logs for the specific cause ("connection_refused at 03:09:14"). Each tool is for a different layer of the question.
This module focuses on the first two — logs and metrics — because they're what every team needs from day one, and they're 80% of the value. Distributed tracing is a deeper topic we touch on in the intermediate track. For now, lock in the difference between the two essentials.
Two things separate logs that help from logs that don't. First: levels — a tag on each log line indicating how loud the event is, so you can filter the firehose. Second: structure — making logs machine-readable instead of human-readable prose. Both sound obvious. Both are skipped in 80% of new codebases. Both come back to bite you the first time production breaks.
The levels let you control verbosity per environment (DEBUG locally, INFO in production) and per service (verbose for the one you're debugging, quiet for the rest). But what really transforms log usefulness is making them structured — emitting JSON instead of free-text strings. Compare:
User alice tried to log in at 12:42pm but it failed because of a bad password (her 3rd attempt from IP 203.0.113.42)
{
"ts": "2024-01-15T12:42:33Z",
"level": "warn",
"msg": "login_failed",
"user": "alice",
"reason": "bad_password",
"attempt_count": 3,
"ip": "203.0.113.42"
}
level=warn AND msg=login_failed AND attempt_count >= 3 finds every problematic login in milliseconds.Structured logging is the single highest-leverage thing you can do for observability. Every log analytics tool — Splunk, Datadog, Loki, CloudWatch — assumes structured input and offers powerful queries on top. With unstructured logs you're forever fighting regex; with structured logs you're filtering by field. The cost is one extra import and a different log function. The payoff is your future self at 3am.
Below: a live production dashboard. Metrics tick. Logs stream. Everything updates in real time. Click an incident button to inject something into the system — a slow database, an error spike, a traffic surge. Then read the signals: which one shows you what just happened? Each incident tells a different story across the four panels.
Google's SRE team distilled the problem to four signals that, monitored together, tell you almost everything about any service. They're called the four golden signals, and they're a great default starting point when you build a new dashboard. If you can only graph four things about a service, graph these.
How long requests take. Track percentiles, not averages — p50 shows the typical user, p99 reveals the suffering minority. avg(latency) hides outliers; the p99 finds them. This was the focus of M.08.
How much demand the system is seeing. Requests per second. Bytes per second for streaming. Spikes hint at viral content; dips hint at upstream failures. Often the first signal that something external just happened to you.
The rate of failed requests. Be careful what counts: HTTP 500s yes, but also slow timeouts, wrong-but-200 responses, and silent failures. An error rate going from 0.1% to 1% is a 10× outage, not a small change.
How "full" the system is — CPU usage, memory usage, queue depth, connection-pool utilization. Tells you how much headroom is left. Latency rises before saturation hits 100%, so watch saturation as the early warning.
Different communities have different acronyms — Google's is the four golden signals; Brendan Gregg's USE method covers Utilization, Saturation, Errors; Tom Wilkie's RED method does Rate, Errors, Duration. They all approximate the same idea: watch demand, watch failure, watch how the system is responding. Pick one framework and use it consistently across all your services. The standardization is more valuable than which framework you chose.
One final practical point: monitor what your users feel, not just what your machines feel. CPU at 80% might be fine if requests are still fast. CPU at 30% might be a disaster if requests are timing out somewhere downstream. The latency, error rate, and traffic numbers reflect user experience; CPU and memory reflect machine state. Both matter, but user-facing signals are the ones that should page someone.
You'll see these in every incident report, every SLO conversation, every dashboard you build. Get fluent.
Test the observability intuition. Click an answer; explanation appears immediately.
You can read the signals now. Reliability is next — the discipline of making sure they show as little drama as possible.
Observability isn't a tool; it's a discipline. These three ideas are the foundation.
Metrics say something is wrong. Logs say what. Traces say where. Each one is the right tool for a different layer of the question.
JSON logs beat prose logs every time. The five-minute switch unlocks a decade of fast incident response. Just do it on day one.
Latency, traffic, errors, saturation. Graph these for every service and you've already got 80% of the observability anyone needs.