Staff Prep 07: Zero-Downtime Postgres Migrations
Back to Part 06: Transactions & Concurrency.
Schema migrations are the riskiest operation in a production database lifecycle. An ALTER TABLE
that takes a full table lock will queue every incoming query behind it. On a 200M-row table, that lock
can hold for 5–30 minutes. That is an outage. The expand-contract pattern lets you evolve schemas without
ever taking a lock your users will notice.
Why migrations are dangerous
Postgres uses AccessExclusiveLock for most DDL operations. This lock blocks all concurrent reads and writes. While the migration runs, every query waits in a queue. If the migration takes 10 minutes, you get 10 minutes of errors.
Operations that take AccessExclusiveLock (dangerous on large tables):
ALTER TABLE ADD COLUMN ... NOT NULL(without a default)ALTER TABLE ALTER COLUMN TYPEALTER TABLE ADD CONSTRAINT ... CHECKCREATE INDEX(without CONCURRENTLY)VACUUM FULL
Operations that are safe (no full table lock):
ALTER TABLE ADD COLUMN(nullable, or with a constant default in Postgres 11+)CREATE INDEX CONCURRENTLYALTER TABLE ADD CONSTRAINT ... CHECK NOT VALID(validates new rows only)ALTER TABLE VALIDATE CONSTRAINT(takes ShareUpdateExclusiveLock, allows reads/writes)
The expand-contract pattern
Every schema change can be decomposed into three phases: Expand (add new structure alongside old), Migrate (backfill data, update application code), and Contract (remove old structure). Each phase is independently deployable.
Adding a NOT NULL column safely
This is the most common dangerous migration. The naïve approach:
-- DANGEROUS: rewrites entire table, holds lock for minutes
ALTER TABLE users ADD COLUMN verified BOOLEAN NOT NULL DEFAULT false;
-- On 200M rows: table rewrite + lock = potential outage
The safe approach (4 steps, each safe):
-- Step 1: Add nullable column (safe — metadata-only change in Postgres 11+)
ALTER TABLE users ADD COLUMN verified BOOLEAN;
-- No table rewrite, no lock. Takes milliseconds.
-- Step 2: Set server-side default (Postgres 11+: stored default, no rewrite)
ALTER TABLE users ALTER COLUMN verified SET DEFAULT false;
-- New rows get false automatically. Existing rows still NULL.
-- Step 3: Backfill in batches (avoid one massive UPDATE)
DO $$
DECLARE
batch_size INT := 10000;
updated INT;
BEGIN
LOOP
UPDATE users
SET verified = false
WHERE id IN (
SELECT id FROM users WHERE verified IS NULL LIMIT batch_size
);
GET DIAGNOSTICS updated = ROW_COUNT;
EXIT WHEN updated = 0;
PERFORM pg_sleep(0.1); -- brief pause between batches
END LOOP;
END;
$$;
-- Step 4: Add NOT NULL constraint (validate without full table lock)
ALTER TABLE users
ADD CONSTRAINT chk_verified_not_null CHECK (verified IS NOT NULL) NOT VALID;
-- NOT VALID: only checks new/updated rows, does not scan existing rows
-- Later (read-only period or low traffic):
ALTER TABLE users VALIDATE CONSTRAINT chk_verified_not_null;
-- Takes ShareUpdateExclusiveLock — allows concurrent reads and writes
-- Finally: drop the check constraint and add proper NOT NULL
ALTER TABLE users DROP CONSTRAINT chk_verified_not_null;
ALTER TABLE users ALTER COLUMN verified SET NOT NULL; -- fast, data already validated
Renaming a column safely
Column renames break every query that references the old name. The safe approach uses a two-version deployment cycle.
-- Goal: rename user_name to full_name
-- Phase 1 (Expand): Add new column, keep old
ALTER TABLE users ADD COLUMN full_name TEXT;
-- Sync via trigger: old column writes propagate to new column
CREATE OR REPLACE FUNCTION sync_user_name()
RETURNS TRIGGER AS $$
BEGIN
NEW.full_name := COALESCE(NEW.full_name, NEW.user_name);
NEW.user_name := COALESCE(NEW.user_name, NEW.full_name);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_sync_user_name
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_user_name();
-- Backfill
UPDATE users SET full_name = user_name WHERE full_name IS NULL;
-- Phase 2: Deploy new application code that writes to full_name
-- Run dual-write for one release cycle
-- Phase 3 (Contract): Remove old column and trigger
DROP TRIGGER trg_sync_user_name ON users;
DROP FUNCTION sync_user_name();
ALTER TABLE users DROP COLUMN user_name;
Changing a column type safely
-- Goal: change user_id from INT to BIGINT (table has 500M rows)
-- Phase 1: Add new column
ALTER TABLE events ADD COLUMN user_id_new BIGINT;
-- Phase 2: Backfill in batches
-- (same batch pattern as above)
UPDATE events SET user_id_new = user_id::BIGINT WHERE user_id_new IS NULL;
-- Phase 3: Trigger to keep in sync during transition
CREATE OR REPLACE FUNCTION sync_user_id()
RETURNS TRIGGER AS $$
BEGIN
NEW.user_id_new := NEW.user_id::BIGINT;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_sync_user_id
BEFORE INSERT OR UPDATE ON events
FOR EACH ROW EXECUTE FUNCTION sync_user_id();
-- Phase 4: Deploy app reading from user_id_new
-- Phase 5: Contract — rename, rebuild index
ALTER TABLE events RENAME COLUMN user_id TO user_id_old;
ALTER TABLE events RENAME COLUMN user_id_new TO user_id;
DROP TRIGGER trg_sync_user_id ON events;
DROP COLUMN user_id_old;
Index changes without downtime
-- Always use CONCURRENTLY for new indexes in production
CREATE INDEX CONCURRENTLY idx_events_user_id_new ON events (user_id);
-- Drop and recreate to change an index definition
-- Step 1: Create new index concurrently (old one still serves queries)
CREATE INDEX CONCURRENTLY idx_events_v2 ON events (user_id, created_at DESC);
-- Step 2: Drop old index (takes brief lock, but concurrent alternative is active)
DROP INDEX CONCURRENTLY idx_events_v1;
-- For UNIQUE indexes: same pattern, but use CREATE UNIQUE INDEX CONCURRENTLY
-- Then add the constraint using the existing index:
ALTER TABLE users ADD CONSTRAINT uq_users_email UNIQUE USING INDEX idx_users_email_unique;
Quiz: test your understanding
Before moving on, answer these in your head (or out loud):
- Why does
ALTER TABLE ADD COLUMN ... NOT NULL DEFAULT 'x'cause a full table rewrite in Postgres 10 but not in Postgres 11+? What changed? - Walk through all steps to safely add a
tenant_id BIGINT NOT NULLcolumn to a 300M-row table in production. How long does each step take? - What is
ADD CONSTRAINT ... NOT VALIDand when would you use it? What doesVALIDATE CONSTRAINTdo differently? - You need to rename a column that 15 microservices query. How do you do this without downtime, coordinating across multiple service deployments?
- What lock does
CREATE INDEX CONCURRENTLYtake vsCREATE INDEX? What can go wrong with CONCURRENTLY and what do you check if it fails?
Next up — Part 08: FastAPI Request Lifecycle. From TCP accept to response: Starlette internals, middleware ordering, lifespan events, and dependency injection.