B-tree vs LSM-tree — read vs write trade-offs and write amplification
The on-disk structure a database uses is the single biggest determinant of whether it's read-optimized or write-optimized. B-trees power most relational stores (Postgres, MySQL/InnoDB); LSM-trees power most write-heavy NoSQL stores (Cassandra, RocksDB, LevelDB, ScyllaDB). Knowing the mechanism — not just the names — lets you predict a store's behavior.
B-tree: update in place
A B-tree is a balanced tree of fixed-size pages. A lookup walks from root to leaf in O(log n) page reads; an update finds the page and modifies it in place, splitting pages when they fill.
- Reads: excellent and predictable — a few page reads, and a given key lives in exactly one place.
- Writes: must locate and rewrite a page (often random I/O), append to the write-ahead log for durability, and occasionally split pages. This is the source of write amplification: one logical row write can cause a WAL write plus a full-page rewrite (a page is many KB), so the bytes written to disk far exceed the bytes of data changed.
LSM-tree: buffer and flush sequentially
An LSM-tree (log-structured merge tree) never updates in place:
- Writes go to an in-memory memtable (plus a commit log for durability) — effectively O(1), all sequential.
- When the memtable fills, it's flushed to disk as an immutable, sorted SSTable — a single sequential write.
- Background compaction merges SSTables, discarding superseded/deleted keys.
- Writes: superb throughput because every disk write is sequential and batched — ideal for ingest-heavy workloads.
- Reads: a key may live in the memtable or any of several SSTables, so a read may check multiple files. Bloom filters cheaply rule out SSTables that can't contain the key; block caches help; but reads are inherently more work than a B-tree's single location.
- Write amplification still exists, but it's deferred into compaction: data is rewritten multiple times as it's merged up the levels. The win is that it's sequential rewriting, not random.
B-TREE LSM-TREE
write -> find page -> rewrite write -> memtable (RAM) -> sequential flush
(random I/O) + WAL |
read -> O(log n) page reads v background
one location per key compaction merges SSTables
read -> memtable + N SSTables
(Bloom filters prune)
Choosing between them
- B-tree when reads dominate or you need predictable low-latency point reads and rich transactional semantics — most OLTP, most relational systems.
- LSM when writes dominate: time-series, event logs, metrics, write-heavy feeds, ingestion pipelines. You trade some read latency and pay compaction (CPU + I/O, and occasional latency spikes) for far higher write throughput.