Disaster recovery testing — why does it matter and how do you do it?

Logical vs physical backups, pg_dump/pg_dumpall/pg_basebackup, point-in-time recovery and WAL archiving, the ecosystem tools, and why DR testing matters.

Cracked Java

An untested backup is not a backup — it's a hope. Disaster recovery testing is the practice of regularly restoring your backups and verifying the result, because the only thing that proves a backup works is a successful restore. The two numbers that frame the whole discipline are RPO and RTO.

RPO and RTO

RPO — Recovery Point Objective: how much data you can afford to lose, measured in time. A nightly pg_dump means RPO up to 24 hours. Base backup + WAL archiving pushes RPO to seconds. RPO is determined by your backup/archiving frequency.
RTO — Recovery Time Objective: how long you can afford to be down. Determined by restore speed — which is exactly why physical backups (fast file copy) often win over logical (slow row-by-row replay) for large databases.

You don't pick a backup strategy in the abstract; you pick the cheapest one that meets the business's RPO and RTO.

Why testing is non-negotiable

Backups fail silently in ways you only discover at restore time: a corrupt WAL segment, a missing --globals dump so roles are gone, an archive that filled and stopped weeks ago, a restore runbook nobody has actually run. The failure mode is always the same — the backup looked fine until the day you needed it.

How to test

Restore on a regular cadence to a throwaway environment — automate it, don't wait for an incident.
Run a real PITR drill: restore a base backup, replay WAL to a chosen recovery_target_time, and confirm the data is at the expected state.
Verify integrity, not just exit code: row counts, checksums, an application smoke test against the restored copy. Tools like pgbackrest verify check archive integrity continuously.
Measure actual RTO during the drill — time the full restore so your RTO is a measured number, not a guess.
Keep the runbook current and let different people execute it, so recovery doesn't depend on one person.