What does ANALYZE do? When should you run it manually? — Cracked Java
// PostgreSQL · Query Planning & EXPLAIN
MidTheory

What does ANALYZE do? When should you run it manually?

ANALYZE collects the table statistics the planner uses to estimate row counts — and stale statistics are the number-one cause of bad plans. Note the name collision: the ANALYZE command gathers stats; the ANALYZE option of EXPLAIN runs a query. This question is about the command.

What it gathers

ANALYZE samples rows from each table and stores, in pg_statistic (readable via the pg_stats view):

  • Number of distinct values per column (n_distinct) — drives equality-selectivity estimates.
  • A list of most-common values and their frequencies (the MCV list) — handles skew.
  • A histogram of value distribution — drives range-selectivity (>, <, BETWEEN).
  • Physical/logical correlation — how well row order matches column order, which affects index-scan cost.
ANALYZE orders;              -- one table
ANALYZE;                     -- whole database
ANALYZE orders (customer_id) -- specific columns

When to run it manually

Autovacuum runs an autovacuum-analyze automatically once enough rows change, so for steady workloads you rarely run it by hand. You do need it manually after:

  • A bulk load / restore / large COPY — fresh data the planner hasn't seen, and autovacuum may not have caught up before your queries run.
  • A major data shift — e.g., a migration that rewrites a column's distribution.
  • Right after creating an expression indexANALYZE then gathers stats on the expression.
  • Inside a post-deploy migration script, as a cheap insurance step before traffic hits new tables.

Tuning the resolution

The sample size (and thus MCV/histogram detail) is controlled by default_statistics_target (default 100). Raise it for a specific column when a skewed distribution causes mis-estimates:

ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;

The correlated-columns gotcha

Plain ANALYZE assumes columns are independent. If city and postal_code are tightly correlated, the planner multiplies their selectivities and badly under-estimates WHERE city = 'X' AND postal_code = 'Y'. The fix is extended statistics:

CREATE STATISTICS s_city_zip (dependencies) ON city, postal_code FROM addresses;
ANALYZE addresses;

Mark your status