Resharding without downtime, and detecting/mitigating hot partitions.

Single-leader / multi-leader / leaderless replication, sync vs async, replication lag, sharding strategies, consistent hashing, resharding, and hot partitions.

Cracked Java

Resharding without downtime, and handling hot partitions

Two operational realities turn partitioning from a one-time decision into an ongoing concern: you eventually need more capacity (resharding), and load is rarely perfectly even (hot partitions). Senior interviews probe both because they're where partitioning meets production.

Resharding without downtime

Resharding is moving data when you add or remove nodes, or change the partition scheme. Done naively (take the system down, repartition, bring it back) it means an outage. Done right, it's online.

Key design choice: more partitions than nodes. A widely used trick (Cassandra, Elasticsearch, Kafka) is to create a fixed, large number of logical partitions up front — far more than nodes — and assign multiple partitions per node. Scaling then means reassigning whole partitions to new nodes, never re-hashing individual keys. With consistent hashing's virtual nodes you get the same effect: adding a node steals scattered arcs rather than splitting ranges.

The online migration flow:

Online resharding: dual-write, backfill, verify, cut over

Dual-write the moving keys to both old and new shards so neither falls behind during the copy.
Backfill historical data in the background, throttled to protect live traffic.
Verify with checksums / row counts before trusting the new shard.
Cut over reads, then drain and remove the old location. Keep the ability to roll back until cutover is confirmed.

Managed stores (DynamoDB, Vitess, CockroachDB) automate most of this — splitting and migrating partitions transparently. Naming that you'd prefer a store that reshards automatically is a valid pragmatic (EU/contracting) answer.

Hot partitions

A hot partition is one shard receiving disproportionate traffic — a thundering-herd key. Causes: a celebrity user, a viral item, a monotonic key on range sharding, or a low-cardinality partition key.

Detection: per-partition metrics — QPS, latency, CPU, and throttle/error rate per shard. A single shard's p99 climbing while others idle is the signature. DynamoDB surfaces this as throttled partitions; you can also log access counts per key.

Mitigations:

Technique	What it does	When
Better partition key	Pick a higher-cardinality key so load spreads	Design time; the real fix
Key salting	Append a random/bucketed suffix → one logical key becomes K physical keys	Write-hot single key
Caching	Absorb reads to the hot key in Redis/CDN before they hit the shard	Read-hot key (very common)
Split the hot partition	Give the hot range its own shard(s)	Range-shard hot spot
Request coalescing	Collapse many concurrent reads of the same key into one	Read stampede