How a NOT NULL Column Migration Locked Our Users Table for 14 Minutes
It was 10:03 AM on a Monday when every status monitor in our dashboard turned red simultaneously. Not a partial outage. Not a single service. Everything — the API, the dashboard, the mobile backend — all down at once. The Slack pager-duty channel exploded. The timing was not a coincidence. We had just deployed a "routine" schema migration two minutes earlier.
The migration looked completely harmless in code review. One line. Adding a column. Adding a column
with a safe default value. What followed was 14 minutes of a full exclusive lock on our 2.4-million-row
users table — and a lesson about Postgres internals that I will carry for the rest of my career.
Production Failure
We were running a multi-tenant SaaS platform. The users table had 2.4 million rows across
all tenants — not enormous by data warehouse standards, but a core operational table hit on nearly
every authenticated API request. Our deployment pipeline ran database migrations as the first step,
before rolling out new application code. That ordering decision, combined with what seemed like a
trivial column addition, caused a complete service blackout.
The migration was part of a feature to track how users had been onboarded — an onboarding_source
column to distinguish users who signed up via organic search, a referral link, or a sales-assisted demo.
The column needed a sensible default for existing rows.
-- This migration ran fine in staging (50,000 rows).
-- In production (2.4M rows), it acquired an exclusive lock for 14 minutes.
ALTER TABLE users
ADD COLUMN onboarding_source VARCHAR(64) NOT NULL DEFAULT 'organic';
In our staging environment — 50,000 rows — this migration ran in 1.2 seconds. In production, with 2.4 million rows, it ran for 14 minutes and held an exclusive lock on the entire table for the full duration. No reads. No writes. Nothing.
False Assumptions
Our first instinct was to blame the application deployment, not the migration. The new code had shipped simultaneously. We immediately rolled back the application tier — new pods swapped for old ones within 3 minutes. The outage continued. Whatever was wrong was not in the application layer.
The second theory was a connection pool exhaustion cascade. We'd seen this before: a slow operation holds connections open, the pool fills, new connections queue, latency spikes, and the system seizes. We checked PgBouncer metrics: 400 connections waiting. That matched the symptom, but it was the symptom, not the cause. Something was holding those connections, not the other way around.
It took us seven minutes to think to check what was actually running inside Postgres:
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
-- pid | duration | state | query
-- -------+-------------+---------+------------------------------------------
-- 48291 | 00:03:42 | active | ALTER TABLE users ADD COLUMN onboarding_source...
-- 48302 | 00:03:42 | active | SELECT id, email FROM users WHERE... (waiting)
-- 48317 | 00:03:41 | active | UPDATE users SET last_login_at = ... (waiting)
-- 48331 | 00:03:41 | active | SELECT * FROM users WHERE tenant_id = ... (waiting)
-- ... (394 more waiting queries)
There it was. PID 48291 had been running for 3 minutes and 42 seconds. The ALTER TABLE
had an exclusive lock on users. Every other query — SELECTs, UPDATEs, anything touching
the table — was queued behind it. We had 397 queries piled up and waiting.
Root Cause: What Postgres Does With NOT NULL + DEFAULT
The root cause was a fundamental misunderstanding of how Postgres handles schema changes internally — one that is extremely easy to get wrong even with years of Postgres experience.
In Postgres versions prior to 11, adding any column with a DEFAULT value required
a full table rewrite. Postgres had to scan every existing row and physically write the new column's
value into each one. For a table with 2.4 million rows and variable-width VARCHAR data,
that meant re-serializing 2.4 million tuples on disk — all while holding an ACCESS EXCLUSIVE
lock that blocks every read and write to the table.
Postgres 11 introduced a major optimization: adding a column with a non-volatile DEFAULT
(a constant value, not NOW() or a function) no longer requires a table rewrite. Postgres
stores the default in the catalog and serves it virtually for old rows, only writing the value on
future row updates. We knew this. We were on Postgres 14. So what went wrong?
The NOT NULL constraint.
Here is the precise behavior: in Postgres 14, adding a column with DEFAULT alone is safe
and fast. Adding NOT NULL DEFAULT forces Postgres to validate that no existing row has a
NULL in that column — which requires scanning every row — and in many cases triggers the legacy
full-rewrite code path depending on the column type and storage. Our VARCHAR(64) NOT NULL DEFAULT
hit the slow path. Postgres rewrote the entire table.
POSTGRES ALTER TABLE — FAST PATH vs. SLOW PATH
─────────────────────────────────────────────────────────────────────
ADD COLUMN col VARCHAR(64) DEFAULT 'organic' ← Postgres 11+: FAST ✓
───────────────────────────────────────────────
No table rewrite. Default stored in pg_attrdef.
Old rows serve the default value virtually.
Lock duration: milliseconds.
─────────────────────────────────────────────────────────────────────
ADD COLUMN col VARCHAR(64) NOT NULL DEFAULT 'organic' ← SLOW PATH ✗
──────────────────────────────────────────────────────
Postgres must validate NOT NULL across all existing rows.
Triggers full table rewrite to physically embed the default.
Acquires ACCESS EXCLUSIVE lock for entire duration.
ACCESS EXCLUSIVE lock blocks:
✗ SELECT ✗ INSERT ✗ UPDATE
✗ DELETE ✗ VACUUM ✗ any concurrent access
2.4M rows × rewrite overhead = 14 minutes of total outage.
─────────────────────────────────────────────────────────────────────
The irony was cruel: we added NOT NULL to enforce data integrity. That exact constraint
turned a 1-second migration into a 14-minute outage.
The staging environment never surfaced this because with 50,000 rows, the rewrite completed in 1.2 seconds — fast enough that we never noticed the lock was even held. The behaviour was identical. The scale was not.
Architecture Fix: The Safe Zero-Downtime Migration Pattern
Fixing this required rethinking how we write any schema migration that touches a large, live table. The correct pattern is to decompose a single dangerous operation into three safe steps deployed across separate releases.
SAFE MIGRATION: 3-STEP ZERO-DOWNTIME PATTERN
─────────────────────────────────────────────────────────────────────
Step 1 — Deploy migration: ADD COLUMN nullable, no constraint
─────────────────────────────────────────────────────────────
ALTER TABLE users ADD COLUMN onboarding_source VARCHAR(64);
-- Fast path. No rewrite. Milliseconds.
-- Existing rows: NULL (allowed for now)
-- New rows: application code writes the value explicitly
─────────────────────────────────────────────────────────────────────
Step 2 — Backfill existing rows in batches (background job)
─────────────────────────────────────────────────────────────
UPDATE users
SET onboarding_source = 'organic'
WHERE onboarding_source IS NULL
AND id BETWEEN ${batchStart} AND ${batchEnd};
-- No table lock. Small batches. Runs offline at low priority.
-- Takes minutes, not seconds — but table stays fully live.
─────────────────────────────────────────────────────────────────────
Step 3 — Deploy migration: add NOT NULL + default after backfill
─────────────────────────────────────────────────────────────────
ALTER TABLE users
ALTER COLUMN onboarding_source SET DEFAULT 'organic',
ALTER COLUMN onboarding_source SET NOT NULL;
-- Postgres 12+: NOT NULL validated via constraint, not rewrite
-- Only safe once zero NULLs remain in the column
─────────────────────────────────────────────────────────────────────
This pattern spreads the work across three deployments and a background job. The table is never locked for more than a few milliseconds at any point. Users never see any interruption.
For the backfill step, we wrote a simple script that processes rows in chunks of 10,000, sleeping 50ms between batches to avoid overwhelming the primary:
import { Pool } from 'pg';
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const BATCH_SIZE = 10_000;
const SLEEP_MS = 50;
async function sleep(ms: number) {
return new Promise(resolve => setTimeout(resolve, ms));
}
async function backfill() {
const { rows: [{ max }] } = await pool.query('SELECT MAX(id) FROM users');
const maxId: number = max;
let batchStart = 1;
let totalUpdated = 0;
while (batchStart <= maxId) {
const batchEnd = batchStart + BATCH_SIZE - 1;
const result = await pool.query(
`UPDATE users
SET onboarding_source = 'organic'
WHERE onboarding_source IS NULL
AND id BETWEEN $1 AND $2`,
[batchStart, batchEnd]
);
totalUpdated += result.rowCount ?? 0;
console.log(`Backfilled up to id=${batchEnd} | total=${totalUpdated}`); // eslint-disable-line
batchStart = batchEnd + 1;
await sleep(SLEEP_MS); // Yield time to avoid replica lag buildup
}
console.log(`Backfill complete. ${totalUpdated} rows updated.`);
await pool.end();
}
backfill().catch(console.error);
The backfill script ran across 2.4 million rows in approximately 18 minutes — longer than the
original migration would have taken — but the users table stayed fully accessible the
entire time. We ran it at 2 AM on a low-traffic night and verified zero NULL rows before deploying
Step 3. The final NOT NULL constraint addition completed in under 200 milliseconds.
What We Changed After This
The incident led to three durable changes in how we manage database schemas:
-
Migration review checklist, enforced in CI. Every migration touching a table
with more than 100,000 rows must pass a review gate. The checklist explicitly flags: any
ADD COLUMN ... NOT NULL,ADD CONSTRAINT, column type changes, and index creation withoutCONCURRENTLY. Pull requests containing flagged patterns require a sign-off from a second engineer who has read the checklist. -
Production row counts in migration review context. We added a GitHub Actions
step that fetches live table sizes from a read replica and posts them as a PR comment alongside
any migration file changes. Reviewers no longer have to remember that the
userstable is 10× larger in production than in staging. -
lock_timeout on all migration runs. We now set
lock_timeout = '5s'andstatement_timeout = '30s'as session-level settings before any production migration. If a migration cannot acquire its lock within 5 seconds (because traffic is ongoing), it fails fast instead of queueing behind live traffic and compounding the problem. A failed migration is infinitely better than a silent 14-minute lock.
-- Applied before every production migration
SET lock_timeout = '5s'; -- Fail fast if we can't get the lock
SET statement_timeout = '30s'; -- Kill runaway migrations before they spiral
SET idle_in_transaction_session_timeout = '10s';
-- The actual migration runs here
i 20260315_add_onboarding_source_step1.sql
RESET lock_timeout;
RESET statement_timeout;
RESET idle_in_transaction_session_timeout;
Lessons Learned
-
NOT NULL + DEFAULT is not the same as DEFAULT alone. Postgres 11+ optimised
nullable column addition with a default. It did not fully optimise the
NOT NULLcase. Always decompose: add nullable, backfill, then constrain. - Staging row counts are fiction. A migration that takes 1 second on 50,000 rows will not take 48 seconds on 2.4 million rows — it may hit an entirely different code path with quadratic I/O characteristics. Always benchmark migrations against a production-sized snapshot.
-
Add indexes CONCURRENTLY, always.
CREATE INDEXholds a lock for the full build duration, same as the table rewrite.CREATE INDEX CONCURRENTLYdoes not. There is no operational reason to ever create an index non-concurrently on a live table. -
lock_timeout is your circuit breaker. Setting
lock_timeoutbefore migrations ensures that if your migration cannot acquire its lock immediately, it aborts cleanly rather than queueing behind 400 live connections and making the outage worse. - Run migrations separately from code deployments. We now treat schema migrations as independent deploys with their own approval step, not as a pre-flight step bundled into the application rollout. This decoupling means a migration problem never takes down application code that was working fine before.
The outage lasted 14 minutes. Incident review lasted two hours. The checklist and CI changes took
a day to implement. We've shipped dozens of schema migrations since — including several on tables
larger than users — and none of them have required so much as a maintenance window.
One SQL keyword cost us a Monday morning. NOT NULL. The lesson isn't that constraints
are bad — it's that adding a constraint to 2.4 million rows of existing data is not a schema change,
it's a data migration. Treat it like one.