PostgresProduction

Postgres Internals 109: Vacuum & Autovacuum

March 29, 202611 min readPART 16 / 19

In Part 8 we saw how the query planner picks execution strategies, and how bad statistics lead to bad plans. This final part goes deeper into the storage layer that makes those statistics stale in the first place: MVCC.

Every UPDATE and DELETE in Postgres creates dead tuples. VACUUM cleans them up. Understanding this is what separates databases that stay healthy under load from the ones that silently accumulate bloat until something snaps. (I've watched a 200GB table grow to 600GB over six months because nobody had touched the autovacuum settings since the table was created small. Doesn't take much.)

Why dead tuples exist

Remember from Part 7: Postgres never overwrites a row in place. When you UPDATE a row, Postgres writes a brand new version of the row (a new tuple) and marks the old version as deleted by setting its xmax to the current transaction ID. When you DELETE a row, it also just sets xmax. The row stays in the heap until VACUUM removes it.

-- This UPDATE doesn't modify the row in place
UPDATE users SET last_login = NOW() WHERE id = 42;

-- What actually happens in the heap:
-- Old tuple: user_id=42, last_login='2026-03-01'  [xmin=100, xmax=205]  ← dead
-- New tuple: user_id=42, last_login='2026-03-29'  [xmin=205, xmax=0]    ← live

On a table receiving 1,000 updates per second, after one hour you have 3.6 million dead tuples. They take up real disk space. Scans have to skip over them. Indexes have entries pointing to them. This is table bloat, and it accumulates until VACUUM intervenes.

-- How many dead tuples does a table have right now?
SELECT
  relname,
  n_live_tup,
  n_dead_tup,
  round(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 1) AS dead_pct,
  last_vacuum,
  last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

What VACUUM does

VACUUM does three things:

Marks dead tuples as reusable. It scans the heap, finds tuples that are no longer visible to any active transaction, and marks their space as available for future inserts. It does NOT return this space to the operating system (that's VACUUM FULL's job).
Updates the Free Space Map (FSM). Postgres tracks which pages have free space so future inserts can find slots without scanning the whole table. VACUUM keeps the FSM accurate.
Updates the Visibility Map (VM). Each page has a flag indicating whether all tuples on it are visible to all transactions (no dead tuples). Index-Only Scans rely on this flag. If a page is "all-visible," Postgres can skip the heap fetch entirely.

-- Run VACUUM manually on a specific table
VACUUM users;

-- With verbose output to see what it's doing
VACUUM VERBOSE users;

-- Sample verbose output:
-- INFO:  vacuuming "public.users"
-- INFO:  scanned index "users_pkey" to remove 12453 row versions
-- INFO:  "users": removed 12453 row versions in 234 pages
-- INFO:  "users": found 12453 removable, 450000 nonremovable row versions in 5123 pages
-- INFO:  index "users_pkey" now contains 450000 row versions in 3421 pages

VACUUM vs VACUUM FULL

They sound similar. They are very different.

VACUUM (regular): Reclaims space for reuse within the same table file. Does NOT shrink the file on disk. Does NOT block reads or writes. Runs concurrently with normal operations. The right choice for routine maintenance.

VACUUM FULL: Rewrites the entire table to a new file, compacts it, then replaces the original. This returns disk space to the OS and eliminates bloat entirely. The catch is the ACCESS EXCLUSIVE lock, which blocks all reads and writes for the duration. On a large table that could be minutes or hours.

VACUUM (regular):
  ✅ Non-blocking
  ✅ Runs concurrently
  ❌ Doesn't shrink file on disk
  Use: routine maintenance, daily autovacuum

VACUUM FULL:
  ✅ Returns disk space to OS
  ✅ Eliminates all bloat
  ❌ Exclusive lock — blocks everything
  ❌ Slow on large tables
  Use: one-time cleanup after massive deletes, when table is severely bloated
       and you can schedule downtime

For production tables where you can't afford downtime, the alternative to VACUUM FULL is pg_repack. It's an extension that rewrites the table without holding an exclusive lock for the entire operation.

-- Check if a table has severe bloat worth a VACUUM FULL
SELECT
  relname,
  pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
  pg_size_pretty(pg_relation_size(oid)) AS table_size,
  n_dead_tup,
  n_live_tup
FROM pg_stat_user_tables
JOIN pg_class ON relname = pg_stat_user_tables.relname
WHERE schemaname = 'public'
ORDER BY pg_relation_size(pg_class.oid) DESC;

Autovacuum

Running VACUUM manually doesn't scale. Autovacuum is the background daemon that automatically vacuums tables when they accumulate enough dead tuples.

Default trigger threshold for a table: autovacuum_vacuum_threshold (50 dead tuples) + autovacuum_vacuum_scale_factor (0.2 = 20% of live tuples). Both must be exceeded.

Trigger condition: n_dead_tup > autovacuum_vacuum_threshold + (autovacuum_vacuum_scale_factor × n_live_tup)
Default:           n_dead_tup > 50 + (0.2 × n_live_tup)

Table with 100,000 rows: autovacuum triggers at 100,000 × 0.2 + 50 = 20,050 dead tuples
Table with 10,000,000 rows: autovacuum triggers at 2,000,050 dead tuples

The scale factor is the problem. On a large table, autovacuum might not trigger until 2 million dead tuples accumulate. By then, queries are scanning all that dead tuples bloat, indexes are oversized, and catching up takes a long time.

Fix: tune autovacuum per-table using storage parameters, not globally:

-- For a high-churn table: trigger autovacuum more aggressively
ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.01,   -- trigger at 1% dead tuples instead of 20%
  autovacuum_vacuum_threshold = 100,        -- or 100 dead tuples, whichever is larger
  autovacuum_analyze_scale_factor = 0.005  -- also update stats more frequently
);

-- For a very large, write-heavy table: dedicated autovacuum workers
ALTER TABLE events SET (
  autovacuum_vacuum_cost_delay = 2        -- 2ms delay instead of default 20ms (faster vacuum)
);

Table bloat and index bloat

Detecting table bloat

pg_stat_user_tables gives you live and dead tuple counts. For a more precise bloat estimate, the pgstattuple extension scans the actual pages:

CREATE EXTENSION IF NOT EXISTS pgstattuple;

-- Detailed bloat analysis (scans the table — use on non-peak hours)
SELECT * FROM pgstattuple('orders');

-- Output includes:
-- table_len:       total bytes on disk
-- tuple_len:       bytes used by live tuples
-- dead_tuple_len:  bytes used by dead tuples
-- free_space:      bytes already reclaimed (available for reuse)

A simpler heuristic from catalog tables (no extension required):

-- Estimated bloat ratio per table
SELECT
  relname AS table_name,
  n_live_tup,
  n_dead_tup,
  CASE WHEN n_live_tup + n_dead_tup > 0
    THEN round(100.0 * n_dead_tup / (n_live_tup + n_dead_tup), 1)
    ELSE 0
  END AS dead_pct,
  pg_size_pretty(pg_relation_size(schemaname || '.' || relname)) AS size
FROM pg_stat_user_tables
WHERE schemaname = 'public'
ORDER BY n_dead_tup DESC;

Index bloat

Indexes accumulate bloat too, often more insidiously than tables. When a row is deleted, its index entry is marked as dead but stays in the index tree until VACUUM cleans it. B-tree indexes that receive many deletes develop large amounts of "half-empty" pages that waste memory and I/O without being reused.

-- Check index sizes and their associated tables
SELECT
  indexname,
  pg_size_pretty(pg_relation_size(indexname::regclass)) AS index_size,
  pg_size_pretty(pg_relation_size(tablename::regclass)) AS table_size
FROM pg_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexname::regclass) DESC
LIMIT 20;

An index that's 3× the size of its table is a signal worth investigating. REINDEX CONCURRENTLY rebuilds an index without blocking reads and writes (Postgres 12+):

REINDEX INDEX CONCURRENTLY orders_user_id_idx;

Transaction ID wraparound: the doomsday clock

This is Postgres's most dangerous failure mode. Most developers don't know about it until they hit it in production at 3am.

Every transaction in Postgres gets an incrementing 32-bit transaction ID (XID). Postgres uses XID comparison to determine visibility: a tuple with xmin=100 is visible to transaction 200 because 200 > 100. The XID counter can hold about 2 billion values. When it wraps around, every table in your database suddenly looks empty to every transaction. A complete and unrecoverable data loss event if not prevented.

XID space:  0 ─────────────────────────── 2,147,483,647 ─── wrap ─── 0
                                           ↑ This is bad

Postgres prevents this through transaction ID freezing. VACUUM periodically walks the heap and replaces old XID values with a special "frozen" marker. Tuples with frozen XIDs are always visible to everyone, so they're safe from wraparound. The parameter autovacuum_freeze_max_age (default: 200 million) controls how old an XID can get before autovacuum is forced to freeze it, regardless of other autovacuum settings.

-- Monitor transaction ID age (the wraparound risk)
SELECT
  relname,
  age(relfrozenxid) AS xid_age,
  pg_size_pretty(pg_relation_size(oid)) AS size
FROM pg_class
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 20;

-- Database-level age (highest risk)
SELECT datname, age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

Warning thresholds to monitor:

XID age > 150 million:  Start paying attention
XID age > 180 million:  Postgres starts issuing WARNING log messages
XID age > 200 million:  Autovacuum freezes aggressively (might impact performance)
XID age > 2 billion:    SHUTDOWN — Postgres refuses to start to prevent data loss

-- Emergency: a table is approaching wraparound
-- Run VACUUM FREEZE to freeze all tuples regardless of other criteria
VACUUM FREEZE VERBOSE orders;

-- Check if autovacuum is keeping up with XID age across all tables
SELECT
  schemaname,
  relname,
  age(relfrozenxid) AS xid_age,
  last_autovacuum,
  last_vacuum
FROM pg_stat_user_tables
JOIN pg_class ON pg_class.relname = pg_stat_user_tables.relname
WHERE age(relfrozenxid) > 100000000  -- tables with XID age > 100M
ORDER BY age(relfrozenxid) DESC;

Tuning autovacuum for high-write tables

The global autovacuum settings are conservative defaults designed not to hammer your disk on small databases. For production write-heavy workloads, you tune per-table.

-- Key global settings (in postgresql.conf)
autovacuum_max_workers = 3          -- increase if you have many tables needing vacuum
autovacuum_vacuum_cost_delay = 20ms -- I/O throttle delay between vacuum work chunks
autovacuum_vacuum_cost_limit = 200  -- work units per delay cycle (increase to vacuum faster)

-- Per-table override for a high-churn events table
ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor    = 0.01,  -- vacuum at 1% dead tuples
  autovacuum_vacuum_threshold       = 1000,   -- or 1000 dead tuples
  autovacuum_vacuum_cost_delay      = 5,      -- less I/O throttling = faster vacuum
  autovacuum_vacuum_cost_limit      = 1000,   -- more work per cycle
  autovacuum_freeze_max_age         = 100000000  -- freeze earlier
);

When autovacuum can't keep up

Signs autovacuum is falling behind:

n_dead_tup growing despite regular autovacuum runs
Autovacuum workers always busy in pg_stat_activity
Table size growing despite no net increase in live rows

-- See currently running autovacuum workers
SELECT
  pid,
  now() - pg_stat_activity.xact_start AS duration,
  query
FROM pg_stat_activity
WHERE query LIKE 'autovacuum:%'
ORDER BY duration DESC;

-- Tables that autovacuum ran on recently
SELECT
  relname,
  last_autovacuum,
  last_autoanalyze,
  autovacuum_count,
  autoanalyze_count,
  n_dead_tup
FROM pg_stat_user_tables
WHERE last_autovacuum IS NOT NULL
ORDER BY last_autovacuum DESC
LIMIT 20;

If autovacuum is consistently behind, the solution is usually: more autovacuum workers (autovacuum_max_workers), less throttling (autovacuum_vacuum_cost_delay), or table-level tuning to trigger vacuum earlier before the backlog grows.

Manual VACUUM strategies for high-write tables

For tables receiving thousands of writes per second, scheduled manual VACUUMs during off-peak hours supplement autovacuum:

-- Vacuum and analyze together — reclaim dead space + update statistics
VACUUM ANALYZE events;

-- More aggressive: vacuum + freeze old XIDs + analyze
VACUUM (ANALYZE, FREEZE) events;

-- Parallel vacuum (Postgres 13+) — uses multiple workers
VACUUM (PARALLEL 4) events;

-- Monitor vacuum progress in real-time
SELECT
  relid::regclass AS table,
  phase,
  heap_blks_scanned,
  heap_blks_total,
  round(100.0 * heap_blks_scanned / NULLIF(heap_blks_total, 0), 1) AS pct_done
FROM pg_stat_progress_vacuum;

Buffer hit ratio: is your cache working?

Related to VACUUM performance: pg_stat_bgwriter tracks how often Postgres finds pages in shared_buffers vs having to read from disk. A healthy production database should have a buffer hit ratio above 95%.

-- Overall buffer hit ratio
SELECT
  sum(heap_blks_hit) AS heap_hit,
  sum(heap_blks_read) AS heap_read,
  round(
    100.0 * sum(heap_blks_hit) / NULLIF(sum(heap_blks_hit) + sum(heap_blks_read), 0),
    2
  ) AS hit_ratio
FROM pg_statio_user_tables;

-- Per-table hit ratio (find tables causing most disk I/O)
SELECT
  relname,
  heap_blks_hit,
  heap_blks_read,
  round(100.0 * heap_blks_hit / NULLIF(heap_blks_hit + heap_blks_read, 0), 1) AS hit_ratio
FROM pg_statio_user_tables
WHERE heap_blks_hit + heap_blks_read > 0
ORDER BY heap_blks_read DESC
LIMIT 20;

-- Background writer: how often Postgres has to flush dirty pages
SELECT
  buffers_clean,
  maxwritten_clean,  -- if this is high, bgwriter is struggling to keep up
  buffers_backend,   -- writes done by backend processes (bad — means bgwriter too slow)
  buffers_checkpoint
FROM pg_stat_bgwriter;

A low hit ratio means your working set doesn't fit in shared_buffers. The fix is usually increasing shared_buffers (common recommendation: 25% of available RAM) or reducing table bloat so the live data footprint shrinks.

Quiz

Q1. A table receives 5,000 UPDATEs per second. After 10 minutes autovacuum hasn't triggered. Why, and how do you fix it?

Default autovacuum threshold is 20% of live rows + 50. If the table has 10 million live rows, autovacuum won't trigger until there are 2,000,050 dead tuples. At 5,000 updates/second that takes about 7 minutes (each UPDATE creates one dead tuple: 10 min × 60s × 5,000 = 3M dead tuples, so it should have just triggered). More likely: autovacuum workers are busy with other tables, or the scale factor calculation means the table needs to cross a higher threshold.

Fix: Lower the per-table scale factor: ALTER TABLE t SET (autovacuum_vacuum_scale_factor = 0.01); This triggers autovacuum at 1% dead tuples (100,000 for a 10M row table) instead of waiting for 20%.

Q2. You run VACUUM FULL on a 200GB production table. What happens to your application during that time?

Complete outage for all queries touching that table. VACUUM FULL acquires an ACCESS EXCLUSIVE lock, which blocks every other operation: SELECTs, INSERTs, UPDATEs, DELETEs. On a 200GB table this could take 30–90 minutes depending on I/O speed.

Production-safe alternative: Use pg_repack, which rebuilds the table in the background and swaps it in with a brief lock at the end. Or use regular VACUUM if the goal is just reclaiming dead tuple space (not returning disk to OS).

Q3. A table's XID age is 190 million. What does that mean and what should you do right now?

The table's oldest unfrozen transaction is 190 million transactions ago. Postgres's wraparound limit is ~2.1 billion, but autovacuum starts forcing freezes at autovacuum_freeze_max_age (default 200 million). You're already in the warning zone. Postgres is logging warnings and autovacuum is running aggressively.

Immediate action: Run VACUUM FREEZE tablename; manually to freeze the old XIDs. Then investigate why autovacuum didn't freeze earlier — was the table locked by a long-running transaction? Was autovacuum disabled? Are there enough autovacuum workers? Lower autovacuum_freeze_max_age to give yourself more headroom in the future.

Q4. Your buffer hit ratio dropped from 98% to 82% after a new feature launched. What's the most likely cause?

The new feature is querying (or bloating) a large table that doesn't fit in shared_buffers. An 82% hit ratio means 18% of page reads are going to disk, which is significant for a database that was at 98%. Common causes:

New queries doing full table scans on large tables (evicting useful pages from cache)
A new write-heavy pattern generating lots of dead tuples, inflating table size beyond cache capacity
A large new table or index being created that fills shared_buffers

Check pg_statio_user_tables per-table to find which table has the most disk reads. Then determine if it's a scan pattern issue (missing index), a bloat issue (VACUUM needed), or a cache size issue (increase shared_buffers or add memory).

Part 9 done. This completes the core Postgres Internals series, from how tables and types work in Part 1 all the way through MVCC, query planning, and vacuum. Check out the full Postgres series to revisit any part.

← PREV

Postgres Internals 108: EXPLAIN & Query Planning