Module 15 / 20 · Phase C — Scale & Reliability · 40 min

Logging
& metrics.

You've built a distributed system. Now something breaks at 3am. Where do you even start? The difference between "five minutes to recover" and "two hours of panic" is whether your system can show you what it's doing — while it's doing it.

// What you'll know by the end

  • The three pillars of observability
  • Logs vs metrics vs traces — when to use each
  • Why structured logging changes everything
  • The four golden signals for any service
§ 01 — 3:14 AM

Your phone
buzzes.

It's the on-call alert: "p99 latency above 1 second for 5 minutes." You squint at your phone, half-awake. The website looks normal. The error rate looks normal. But the latency tells a story — some users are getting service that's 10× slower than usual. Now: where do you start? Your answer depends entirely on what your system shows you when you ask it.

// THE 3AM PAGE · TWO VERY DIFFERENT NIGHTS
// WITHOUT OBSERVABILITY
Two hours of panic
3:14 Alert fires. Open laptop.
3:16 SSH into a server. No metrics dashboard.
3:22 Tail logs. Text-only, 50 lines/sec, no filters.
3:35 Run top. CPU high. Why?
4:10 Notice a recent deploy. Coincidence?
5:05 Roll back blind. It works. Still don't know why.
// WITH OBSERVABILITY
Six minutes to fix
3:14 Alert fires. Open laptop.
3:15 Dashboard: p99 spike at 3:09. Error rate steady.
3:16 "Slow endpoint" panel: /api/orders.
3:17 Logs filtered: service=orders level=warn
3:18 "Slow query, duration=1200ms". Found it.
3:20 Roll back. Latency back to 30ms. Back to sleep.

That gap — two hours vs six minutes, panic vs procedure — isn't about being smarter at 3am. It's about whether your system can tell you what it's doing. When something breaks (and it will), the only thing that matters is how fast you can answer three questions: What is broken? Why? When did it start? Observability is the discipline of building systems that can answer those questions cheaply, repeatedly, and at 3am.

§ 02 — The three pillars

Three windows
into your system.

The industry has settled on three complementary signals that, together, make a system observable. They're often called the three pillars: logs, metrics, and traces. They overlap a little, but each one is the right tool for a different question. A good production setup has all three; a great one has them connected so you can pivot from one to the other in a few clicks.

// LOGS · METRICS · TRACES — THE OBSERVABILITY TRIAD

// PILLAR 1

Logs

"What happened?"

Text records of individual events. One log line per request, per error, per significant action. High detail, hard to summarize. Use logs when you need to know exactly what one specific request did.

12:42:01 INFO user_login user_id=42 ip=1.2.3.4 result=success
// PILLAR 2

Metrics

"How often? How much?"

Numbers measured over time. Counters, gauges, histograms. Low detail, easy to summarize and graph. Use metrics to see overall system health and detect anomalies.

http_requests_total{status="500"} = 1247
request_duration_p99 = 240ms
// PILLAR 3

Traces

"How did this flow?"

A request's journey across services. Shows timing for each hop: API → auth → DB → cache. The right tool for distributed systems where one request touches many components.

[15ms] api-gateway → [3ms] auth → [180ms] orders-db

The metaphor that sticks: metrics tell you something is wrong, logs tell you what specifically went wrong, traces tell you where in the system it went wrong. You start with metrics on a dashboard (the noisiest signal — "p99 just spiked"), drill into traces for a slow request (the path — "the DB call took 2 seconds"), then drop into logs for the specific cause ("connection_refused at 03:09:14"). Each tool is for a different layer of the question.

Metrics tell you something is wrong. Logs tell you what. Traces tell you where.

This module focuses on the first two — logs and metrics — because they're what every team needs from day one, and they're 80% of the value. Distributed tracing is a deeper topic we touch on in the intermediate track. For now, lock in the difference between the two essentials.

§ 03 — Log levels & the case for structure

Not every log
is equal.

Two things separate logs that help from logs that don't. First: levels — a tag on each log line indicating how loud the event is, so you can filter the firehose. Second: structure — making logs machine-readable instead of human-readable prose. Both sound obvious. Both are skipped in 80% of new codebases. Both come back to bite you the first time production breaks.

// LOG LEVELS · FROM "WHATEVER" TO "EVERYTHING IS ON FIRE"

DEBUG
Development noise. Variable contents, intermediate steps, "I'm here." Off in production by default — would crush log storage. Useful for tracking down specific bugs.
INFO
Normal events worth knowing about. User logged in, request completed, deploy started. The default "this happened" log. Most production logs are INFO.
WARN
Something concerning but not yet broken. A slow query, a deprecated API being called, a retry succeeded after failing once. Worth investigating but doesn't page anyone.
ERROR
A failure. Request errored out, query timed out, external API returned 500. Something is broken for at least one user. Often triggers alerts when the rate spikes.
FATAL
The process is dying. Can't connect to the database, out of memory, unrecoverable corruption. Wake someone up immediately. Usually rare.

The levels let you control verbosity per environment (DEBUG locally, INFO in production) and per service (verbose for the one you're debugging, quiet for the rest). But what really transforms log usefulness is making them structured — emitting JSON instead of free-text strings. Compare:

// UNSTRUCTURED · HUMAN-FRIENDLY ONLY
User alice tried to log in at 12:42pm but it failed because of a bad password (her 3rd attempt from IP 203.0.113.42)
Looks fine until you want to count failed logins per user or find all IPs with 3+ failures. You'd need to parse English with regex. Painful, unreliable, slow.
// STRUCTURED · MACHINE-QUERYABLE
{
  "ts": "2024-01-15T12:42:33Z",
  "level": "warn",
  "msg": "login_failed",
  "user": "alice",
  "reason": "bad_password",
  "attempt_count": 3,
  "ip": "203.0.113.42"
}
Same information. Now queryable: level=warn AND msg=login_failed AND attempt_count >= 3 finds every problematic login in milliseconds.

Structured logging is the single highest-leverage thing you can do for observability. Every log analytics tool — Splunk, Datadog, Loki, CloudWatch — assumes structured input and offers powerful queries on top. With unstructured logs you're forever fighting regex; with structured logs you're filtering by field. The cost is one extra import and a different log function. The payoff is your future self at 3am.

§ 04 — Live dashboard · interactive lab

Now feel the
signals.

Below: a live production dashboard. Metrics tick. Logs stream. Everything updates in real time. Click an incident button to inject something into the system — a slow database, an error spike, a traffic surge. Then read the signals: which one shows you what just happened? Each incident tells a different story across the four panels.

PROD_DASHBOARD.SIM // m.15 lab
INJECT:
Requests / sec
50/s
p50 latency
30ms
p99 latency
95ms
Error rate
0.2%
// LATENCY · last 60s — p50 — p95 — p99
// what to look for
Baseline: all clear
Everything healthy. p50 around 30ms, p99 around 90ms, error rate near zero, traffic steady at ~50 req/s. Inject an incident to see how each signal responds differently.
// STRUCTURED LOG STREAM · live
0 events
§ 05 — The four golden signals

What to watch,
everywhere.

Google's SRE team distilled the problem to four signals that, monitored together, tell you almost everything about any service. They're called the four golden signals, and they're a great default starting point when you build a new dashboard. If you can only graph four things about a service, graph these.

// SIGNAL 1

Latency

How long requests take. Track percentiles, not averages — p50 shows the typical user, p99 reveals the suffering minority. avg(latency) hides outliers; the p99 finds them. This was the focus of M.08.

// SIGNAL 2

Traffic

How much demand the system is seeing. Requests per second. Bytes per second for streaming. Spikes hint at viral content; dips hint at upstream failures. Often the first signal that something external just happened to you.

// SIGNAL 3

Errors

The rate of failed requests. Be careful what counts: HTTP 500s yes, but also slow timeouts, wrong-but-200 responses, and silent failures. An error rate going from 0.1% to 1% is a 10× outage, not a small change.

// SIGNAL 4

Saturation

How "full" the system is — CPU usage, memory usage, queue depth, connection-pool utilization. Tells you how much headroom is left. Latency rises before saturation hits 100%, so watch saturation as the early warning.

Different communities have different acronyms — Google's is the four golden signals; Brendan Gregg's USE method covers Utilization, Saturation, Errors; Tom Wilkie's RED method does Rate, Errors, Duration. They all approximate the same idea: watch demand, watch failure, watch how the system is responding. Pick one framework and use it consistently across all your services. The standardization is more valuable than which framework you chose.

One final practical point: monitor what your users feel, not just what your machines feel. CPU at 80% might be fine if requests are still fast. CPU at 30% might be a disaster if requests are timing out somewhere downstream. The latency, error rate, and traffic numbers reflect user experience; CPU and memory reflect machine state. Both matter, but user-facing signals are the ones that should page someone.

§ 06 — Eight words for the obs layer

Vocabulary,
for the 3am page.

You'll see these in every incident report, every SLO conversation, every dashboard you build. Get fluent.

Observability
/əbˌzɜːvəˈbɪlɪti/
The property of a system that lets you understand what it's doing from the outside. Built from logs, metrics, and traces. "Obs" for short.
SLO / SLA
/ɛs ɛl oʊ/
"Service Level Objective" — internal target ("99.9% of requests under 200ms"). "Service Level Agreement" — promise to customers, often with money attached.
Counter
/ˈkaʊntə/
A metric that only goes up — request counts, errors, bytes sent. Reset only on process restart. Used to derive rates by differentiating over time.
Gauge
/ɡeɪdʒ/
A metric that goes up and down — current CPU usage, memory in use, queue depth. Snapshot of a value at a point in time.
Histogram
/ˈhɪstəɡræm/
A metric that tracks the distribution of values — usually latency. Lets you compute percentiles (p50, p95, p99) without storing every individual sample.
Cardinality
/ˌkɑːdɪˈnælɪti/
The number of unique label/tag combinations on a metric. Beware high cardinality (user_id as a label) — it can crush your metrics store.
Alert
/əˈlɜːt/
A condition on a metric that, when met, pages someone (Slack, PagerDuty, phone). Should be tied to user-impact, not arbitrary thresholds.
Dashboard
/ˈdæʃbɔːd/
A page of correlated graphs and stats. Grafana, Datadog, CloudWatch. One per service, focused on the four golden signals. Bookmarked by your future self at 3am.
§ 07 — Knowledge check

Five questions.
Mind the signals.

Test the observability intuition. Click an answer; explanation appears immediately.

QUESTION 1 OF 5
Loading question...
Score: 0 / 5
5 / 5

Observed.

You can read the signals now. Reliability is next — the discipline of making sure they show as little drama as possible.

§ 08 — The recap

Three ideas to
carry forward.

Observability isn't a tool; it's a discipline. These three ideas are the foundation.

i

Logs + metrics + traces

Metrics say something is wrong. Logs say what. Traces say where. Each one is the right tool for a different layer of the question.

ii

Structure matters

JSON logs beat prose logs every time. The five-minute switch unlocks a decade of fast incident response. Just do it on day one.

iii

Four golden signals

Latency, traffic, errors, saturation. Graph these for every service and you've already got 80% of the observability anyone needs.

↓ UP NEXT

M.16 — Reliability:
SPOFs, retries,
circuit breakers.

You can now see when something breaks. Time to stop it from breaking in the first place — or at least from breaking everything when one piece fails. The patterns that keep systems standing up under stress.

Continue to Module 16 →