Staff Prep 21: Postgres MVCC — Multi-Version Concurrency Under the Hood
Back to Part 20: API Design at Scale. MVCC (Multi-Version Concurrency Control) is Postgres's most important concurrency mechanism. It is why readers and writers do not block each other. It is also why VACUUM exists and why dead tuples accumulate. Understanding MVCC at the implementation level is what separates a Senior from a Staff Postgres engineer.
What MVCC is and why it exists
Traditional locking systems block reads when a write is in progress. Postgres avoids this entirely through versioning: instead of updating a row in place, Postgres creates a new version of the row and marks the old version as dead. Readers see the version that was current when their transaction started — regardless of any concurrent writes.
Result: reads never block writes, writes never block reads. This is the foundation of Postgres's concurrency performance.
Xmin, xmax: the visibility system
-- Every row has hidden system columns:
-- xmin: transaction ID that created this row version
-- xmax: transaction ID that deleted/updated this row version (0 = still live)
-- ctid: physical location (block, offset) of this row version
SELECT xmin, xmax, ctid, id, name
FROM users WHERE id = 42;
-- Result:
-- xmin=1024, xmax=0, ctid=(5,3), id=42, name="Alice"
-- xmin=1024: created by transaction 1024
-- xmax=0: no transaction has deleted it (still live)
-- After an UPDATE:
UPDATE users SET name = 'Alicia' WHERE id = 42;
-- Old row: xmin=1024, xmax=1025, ctid=(5,3) ← marked dead by txn 1025
-- New row: xmin=1025, xmax=0, ctid=(7,1) ← new version at new physical location
-- Both rows exist in the heap simultaneously
-- A transaction with snapshot before 1025 sees the old row
-- A transaction with snapshot after 1025 commits sees the new row
Transaction snapshots
-- What does a transaction's "snapshot" contain?
-- xmin: lowest active transaction ID at snapshot time
-- xmax: highest transaction ID at snapshot time + 1
-- xip_list: list of active (in-progress) transaction IDs at snapshot time
-- Visibility rule for a row version (simplified):
-- A row is visible to my snapshot if:
-- row.xmin < my_snapshot.xmax (row was created before my snapshot)
-- AND row.xmin NOT IN my_snapshot.xip_list (creator was committed, not in-progress)
-- AND (row.xmax = 0 (row is not deleted)
-- OR row.xmax > my_snapshot.xmax (deleter started after my snapshot)
-- OR row.xmax IN my_snapshot.xip_list) (deleter not yet committed)
-- This is why READ COMMITTED gets a fresh snapshot per statement
-- REPEATABLE READ gets a snapshot once at transaction start and holds it
Dead tuples and their impact
-- Dead tuples accumulate from UPDATEs and DELETEs
-- Old row versions are kept until VACUUM reclaims them
-- Monitor dead tuple accumulation
SELECT
schemaname,
tablename,
n_live_tup,
n_dead_tup,
ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;
-- A table with 30% dead tuples is reading 43% more data than necessary
-- Queries scan dead tuples before discarding them (unless heap_only_tuple optimization applies)
-- Force VACUUM when bloat is high
VACUUM ANALYZE orders;
-- VACUUM FULL: reclaims space and compacts, but takes AccessExclusiveLock
-- Use pg_repack instead (FULL without lock):
-- pg_repack --table orders mydb
Transaction ID wraparound: the catastrophic failure mode
-- Transaction IDs are 32-bit unsigned integers
-- Maximum: ~2 billion (2^32 / 2 = ~2.1 billion, with wraparound protection)
-- If VACUUM does not freeze old transactions, the DB will go read-only to prevent data loss
-- Check for wraparound risk
SELECT
datname,
age(datfrozenxid) AS age,
2000000000 - age(datfrozenxid) AS txns_until_wraparound
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
-- If age > 1.5 billion: URGENT — run VACUUM FREEZE immediately
-- If age > 2 billion: Postgres will shut down with "database is not accepting commands to avoid wraparound"
-- Freeze old transactions proactively
VACUUM FREEZE users;
-- Configure autovacuum to freeze aggressively
ALTER TABLE high_volume_table SET (
autovacuum_freeze_max_age = 500000000 -- freeze at 500M (default: 200M)
);
HOT updates: MVCC optimisation
-- HOT = Heap-Only Tuple: an optimisation when:
-- 1. The updated columns are NOT indexed
-- 2. The new row version fits on the same page as the old one
-- With HOT: no index entries are updated (cheaper write, faster vacuum)
-- Without HOT: every index must be updated for every row update
-- Check HOT ratio
SELECT
relname,
n_tup_upd,
n_tup_hot_upd,
ROUND(100.0 * n_tup_hot_upd / NULLIF(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables
ORDER BY n_tup_upd DESC;
-- Low HOT ratio (< 50%): check if you are updating indexed columns unnecessarily
-- Common fix: use separate columns for mutable and immutable data
-- Or increase fillfactor to leave room for HOT updates on the same page:
ALTER TABLE users SET (fillfactor = 70); -- leave 30% free space per page for HOT updates
Quiz: test your understanding
Before moving on, answer these in your head (or out loud):
- Why do readers never block writers in Postgres? What mechanism allows this?
- You run
UPDATE users SET name = 'Bob' WHERE id = 1. What physically happens in the heap? How many row versions exist after this command? - What are dead tuples? How do they accumulate? What happens if VACUUM never runs?
- Explain transaction ID wraparound in one paragraph. What is the failure mode if it is not managed?
- What is a HOT update and when does it apply? How do you increase your HOT ratio?
Next up — Part 22: Partitioning & Sharding. Range, list, hash partitioning, partition pruning, and when to actually shard.