Module 06 / 20 · Phase B — Data & Storage · 40 min

Indexes & the
B-Tree secret.

How does a database find one row out of ten million in under a millisecond? The answer is a 50-year-old data structure quietly running half the internet. Crack open the B-tree and SQL stops looking like magic.

// What you'll know by the end

  • Why a scan of 10M rows is slow (and obvious in hindsight)
  • What an "index" actually is, under the hood
  • How a B-tree turns O(n) into O(log n)
  • When to add an index — and when it hurts you
§ 01 — A small thought experiment

Find one row
in ten million.

Your users table has 10 million rows. You ask the database: "give me the user whose email is alice@example.com." If the database had no clever tricks, it would do the obvious thing — check row 1, check row 2, check row 3, all the way to row ten million. On average it would touch 5 million rows before finding hers. Watch what that looks like across different table sizes:

// SEQUENTIAL SCAN · FIND ONE ROW BY EMAIL
1,000 rows
~0.5ms
100,000 rows
~50ms
1,000,000 rows
~500ms
10,000,000 rows
~5,000ms
100,000,000 rows
~50s 💀

Five seconds to find a single user is unacceptable. Fifty seconds is an outage. And yet databases do this lookup all the time — on tables with billions of rows — and return answers in single-digit milliseconds. Something very clever has to be happening, and that something is called an index.

§ 02 — The phone book idea

A library card
catalog.

Forget databases for a second. Think about a physical library with a million books. If someone asks you "do you have a book called The Great Gatsby?", you don't walk to every shelf. You walk to the card catalog — an alphabetical drawer of cards, each pointing to a book's location. You flip to "G," find the card, walk to the shelf number it lists, and you're done. Three minutes instead of three weeks.

// THE SAME LOOKUP · TWO WAYS

Walk every shelf
// no card catalog

Check every book in the library, one by one. Each one is "is this Gatsby? no... is this Gatsby? no..." For a million books, you're checking 500,000 on average before finding it.

O(n) · LINEAR · SLOW
Card catalog
// alphabetical index

Open the catalog drawer. Cards are sorted. You can skip half the alphabet at every step. Twenty checks gets you to any card in a million. Then walk straight to the shelf.

O(log n) · LOGARITHMIC · FAST

That's it. An index in a database is the card catalog. It's a separate, sorted structure that maps a column's values to the row locations. When you say SELECT * FROM users WHERE email = 'alice@...', the database checks if there's an index on email. If yes, it walks the index — a logarithmic number of steps — and jumps straight to the row. If no, it falls back to scanning every single row.

An index is a sorted side-table that says "here's where to find what you're looking for."

The structure that databases actually use for this isn't a stack of paper cards, but the idea is exactly the same — sorted, skippable, designed for fast lookups. The structure is called a B-tree (it stands for "balanced," not "binary"), and once you see how it works, the speed makes total sense.

§ 03 — Inside the B-tree

A tree built
to be shallow.

A B-tree is a tree where each node holds many keys (not just one, like a binary tree). Each node stores sorted keys and pointers to child nodes. When you search, you compare the target against the keys in one node and descend to the child whose range covers your value. Because each node has many children, the tree stays very shallow — even a billion rows fits in 3 or 4 levels.

// B-TREE LOOKUP · FINDING USER WITH ID = 4,267,891

2.5M 5M 7.5M ROOT · level 0 → 4.2M sits in 2.5M-5M [1M, 2M] 3M 4M 4.5M [5.5M, 7M] [8M, 9.5M] INTERNAL · level 1 4.2M 4.27M 4.31M 4.4M + row pointers LEAF · level 2 · found 4.27M ✓ → jumps to actual row in table

That picture is the whole secret. To find row 4,267,891 in a table of 10 million rows, the database checked just 3 nodes. The root node ("is it before or after 2.5M / 5M / 7.5M?"), one internal node ("is it before or after 3M / 4M / 4.5M?"), and the leaf node ("here's the exact key and its pointer to the actual row"). Three reads instead of 5 million. That's the difference between 5 milliseconds and 5 seconds.

The depth grows logarithmically. A B-tree with 100 keys per node holds about a million rows in 3 levels, a billion rows in 4 levels, and a trillion rows in 5. That ceiling is why databases scale so well. Doubling your data adds zero levels most of the time.

§ 04 — The index race · interactive lab

Now race them.
Watch what 1,000,000× faster looks like.

Pick a table size and hit Race. On the left, a sequential scan starts checking rows one by one. On the right, a B-tree lookup descends three levels and stops. The numbers tell the story even better than the animation.

INDEX_RACE.SIM // m.06 lab
TABLE SIZE:
⌘ NO INDEX · SEQUENTIAL SCAN O(n)
Rows examined
0
Elapsed
0ms
press Race to begin
⌘ INDEXED · B-TREE LOOKUP O(log n)
Nodes visited
0
Elapsed
0ms
press Race to begin
// VERDICT
Press Race to see the difference

Each row size shows a different ratio. At 10K rows, scanning is annoying but survivable. At 1M, it's the difference between snappy and sluggish. At 100M, it's the difference between working and broken. This is why indexes aren't optional at scale.

§ 05 — The cost of indexes

Indexes aren't
free.

If indexes made reads a million times faster with no downside, every column would have one. They don't, because indexes cost real things. Understanding what they cost is the difference between a sluggish database and a fast one.

// COST 1

Writes get slower

Every INSERT, UPDATE, or DELETE must also update every index that includes the changed column. A table with 5 indexes does 6× the work on every write. If your app writes more than it reads, indexes can hurt.

// COST 2

Disk & memory

An index is a separate B-tree stored on disk. It can be huge — sometimes as large as the table itself. The hot parts get cached in RAM, competing with the data itself for memory. More indexes = more storage cost.

// COST 3

Wrong index = no help

An index on email doesn't speed up queries on name. Worse: with a query the optimizer can't recognize, it might ignore your index entirely and scan anyway. Indexes are precise tools, not magic dust.

// COST 4

Low cardinality = useless

An index on a column with only 2-3 distinct values (like is_active) barely helps — every value still matches millions of rows. Indexes work best on columns with many unique values: emails, IDs, timestamps.

// WHEN TO ADD AN INDEX · QUICK RULES

i
Columns used in WHERE clauses on big tables — especially equality checks (email = ?) and ranges (created_at > ?)
YES, INDEX
ii
Foreign keys — Postgres doesn't auto-index them, and most JOINs rely on these
YES, INDEX
iii
Columns used to sort (ORDER BY) frequently, especially with LIMIT
YES, INDEX
iv
Tiny tables (under ~10K rows) — the scan is already fast enough
NO, SKIP
v
Columns with very few distinct values (booleans, status enums)
NO, SKIP
vi
Heavy write-only tables where reads are rare (e.g., audit logs)
NO, SKIP

The pragmatic workflow: don't pre-emptively index everything. Build the app, run real queries, look at slow query logs, then add an index where one is genuinely missing. PostgreSQL's EXPLAIN ANALYZE will literally show you when it's scanning vs. using an index — read those plans, and your intuition will sharpen fast.

§ 06 — Eight words for query land

Vocabulary,
for performance.

You'll see these in every conversation about slow queries. Know them well.

Index
/ˈɪndɛks/
A sorted side-structure that maps column values to row locations. Lets the database find rows without scanning. Almost always a B-tree.
B-Tree
/biː triː/
"Balanced tree." A tree where each node holds many keys, sorted. Shallow even for huge data. The default index type in every relational database.
Cardinality
/ˌkɑːdɪˈnælɪti/
How many distinct values a column has. High = good for indexing (email, IDs). Low = bad (boolean, status).
Full Table Scan
/fʊl ˈteɪbəl skæn/
When the database reads every row to answer a query. The thing indexes prevent. Appearing in your query plans = warning sign.
Composite Index
/ˈkɒmpəzɪt ˈɪndɛks/
An index on multiple columns together, e.g., (user_id, created_at). Order matters — works for queries on user_id alone, or both, but not created_at alone.
Query Planner
/ˈkwɪəri ˈplænə/
The component inside a database that decides how to execute a query — which indexes to use, join order, etc. EXPLAIN shows you what it picked.
Covering Index
/ˈkʌvərɪŋ ˈɪndɛks/
An index that contains all columns your query needs — so the database can answer entirely from the index without touching the table. Maximally fast.
EXPLAIN
/ɪkˈspleɪn/
A SQL keyword that shows the query plan instead of running the query. Your single best tool for debugging slow queries. Use it constantly.
§ 07 — Knowledge check

Five questions.
Defend your plan.

Quick test of the index intuition. Click an answer; get the explanation.

QUESTION 1 OF 5
Loading question...
Score: 0 / 5
5 / 5

Tuned.

You see the planner now. Onward — caching is next, and it shaves even more milliseconds.

§ 08 — The recap

Three ideas to
carry forward.

The mental model of "scan vs. tree" is one of the most useful things you'll ever learn.

i

Indexes are sorted side-tables

Not magic. A separate B-tree mapping column values to row locations. The database checks for one; if it exists, it uses it.

ii

O(log n) is the gift

A billion rows fits in 4 tree levels. Doubling your data adds about zero work. This is why databases scale to massive sizes.

iii

Indexes cost writes

Every write must update every index. Pick the few that matter. Read query plans (EXPLAIN) to know what's actually being used.

↓ UP NEXT

M.07 — Caching,
and the art of forgetting.

Indexes make a database fast. Caches make it invisible — by skipping it entirely. Time to look at the layer between your app and your database that handles the easy questions before they ever reach the disk.

Continue to Module 07 →