March 15, 2026Database10 min read

How a NOT NULL Column Migration Locked Our Users Table for 14 Minutes

Published March 15, 202610 min read

10:03 AM, Monday. Every board in the dashboard goes red at the same instant. API, web, mobile backend, all of it. Pager-duty starts stacking notifications faster than I can dismiss them.

Two minutes earlier we'd shipped a schema migration. One of those migrations I'd specifically filed under "routine" in my head on the way to the review, which in hindsight is the universal signal you're about to have a bad morning.

The migration was one line. Adding a column with a safe default value. What followed was 14 minutes of a full exclusive lock on our 2.4-million-row users table, and a lesson about Postgres internals I now mentally replay every time I review an ALTER TABLE.

Production failure

We were running a multi-tenant SaaS. The users table had 2.4 million rows across all tenants. Not enormous by warehouse standards, but a core operational table hit on nearly every authenticated API request. Our deploy pipeline ran database migrations first, before rolling out new application code. That ordering plus a trivial-looking column addition took down everything.

The migration added an onboarding_source column to track how users had signed up: organic, referral, or sales-assisted demo. The column needed a default for existing rows.

migrations/20260315_add_onboarding_source.sql (the offender)

-- This migration ran fine in staging (50,000 rows).
-- In production (2.4M rows), it acquired an exclusive lock for 14 minutes.

ALTER TABLE users
  ADD COLUMN onboarding_source VARCHAR(64) NOT NULL DEFAULT 'organic';

In staging, with 50,000 rows, this migration ran in 1.2 seconds. In production, with 2.4 million rows, it ran for 14 minutes and held an exclusive lock on the entire table the whole way. No reads, no writes, nothing.

14 min Full table lock duration

2.4M Rows rewritten

100% API error rate

1.2s Same migration in staging

False assumptions

Our first instinct was to blame the application deploy, not the migration. New code had shipped simultaneously. We rolled back the application tier, new pods swapped in within 3 minutes. The outage kept going. Whatever was wrong wasn't in the app layer.

Second theory: connection pool exhaustion cascade. We'd seen this pattern before. A slow operation holds connections, the pool fills, new connections queue, latency spikes, system seizes. PgBouncer metrics showed 400 connections waiting. That matched the symptom but wasn't the cause. Something was holding those connections, not the other way around.

It took us seven minutes to actually check what was running inside Postgres. Seven minutes in which I was also trying to explain to the CEO over Slack what "probably an application bug" meant.

postgres — active queries at 10:06 AM

SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

--  pid   |  duration   |  state  | query
-- -------+-------------+---------+------------------------------------------
-- 48291  | 00:03:42    | active  | ALTER TABLE users ADD COLUMN onboarding_source...
-- 48302  | 00:03:42    | active  | SELECT id, email FROM users WHERE...       (waiting)
-- 48317  | 00:03:41    | active  | UPDATE users SET last_login_at = ...        (waiting)
-- 48331  | 00:03:41    | active  | SELECT * FROM users WHERE tenant_id = ...   (waiting)
-- ... (394 more waiting queries)

There it was. PID 48291 running for 3 minutes 42 seconds. The ALTER TABLE had an exclusive lock on users. Every other query touching the table (SELECTs, UPDATEs, all of it) was queued behind it. 397 queries piled up and waiting.

Root cause: what Postgres does with NOT NULL + DEFAULT

The root cause was a misunderstanding of how Postgres handles schema changes internally. It's the kind of thing that's easy to get wrong even after years of working with Postgres, and I say that as someone who thought he knew.

Before Postgres 11, adding any column with a DEFAULT value required a full table rewrite. Postgres scanned every existing row and physically wrote the new column's value into each one. For 2.4 million rows of variable-width VARCHAR data, that meant re-serializing 2.4 million tuples on disk while holding an ACCESS EXCLUSIVE lock that blocks every read and write to the table.

Postgres 11 introduced the optimisation I thought protected me. Adding a column with a non-volatile DEFAULT (a constant, not NOW() or a function) no longer requires a rewrite. Postgres stores the default in the catalog and serves it virtually for old rows, only writing the value on future updates. We were on Postgres 14, I knew about this optimisation, and I still walked into the lock.

The culprit was the NOT NULL constraint.

Precise behaviour in Postgres 14: adding a column with DEFAULT alone is fast. Adding NOT NULL DEFAULT forces Postgres to validate that no existing row has a NULL in that column (requires a full scan), and for certain column types and storage combinations it triggers the legacy full-rewrite path. Our VARCHAR(64) NOT NULL DEFAULT hit the slow path. Postgres rewrote the entire table.

  POSTGRES ALTER TABLE — FAST PATH vs. SLOW PATH

  ─────────────────────────────────────────────────────────────────────

  ADD COLUMN col VARCHAR(64) DEFAULT 'organic'   ← Postgres 11+: FAST ✓
  ───────────────────────────────────────────────
  No table rewrite. Default stored in pg_attrdef.
  Old rows serve the default value virtually.
  Lock duration: milliseconds.

  ─────────────────────────────────────────────────────────────────────

  ADD COLUMN col VARCHAR(64) NOT NULL DEFAULT 'organic'  ← SLOW PATH ✗
  ──────────────────────────────────────────────────────
  Postgres must validate NOT NULL across all existing rows.
  Triggers full table rewrite to physically embed the default.
  Acquires ACCESS EXCLUSIVE lock for entire duration.

  ACCESS EXCLUSIVE lock blocks:
    ✗ SELECT          ✗ INSERT          ✗ UPDATE
    ✗ DELETE          ✗ VACUUM          ✗ any concurrent access

  2.4M rows × rewrite overhead = 14 minutes of total outage.

  ─────────────────────────────────────────────────────────────────────

The irony is a bit on the nose: we added NOT NULL to enforce data integrity, and that exact constraint turned a 1-second migration into a 14-minute outage.

Staging never caught this because 50,000 rows rewrote in 1.2 seconds, fast enough that nobody noticed the lock was held. The behaviour was identical. The scale was not.

Architecture fix: the safe zero-downtime migration pattern

The fix was rethinking how we write any schema migration that touches a large, live table. The pattern is to decompose a single dangerous operation into three safe steps across separate releases.

  SAFE MIGRATION: 3-STEP ZERO-DOWNTIME PATTERN

  ─────────────────────────────────────────────────────────────────────

  Step 1 — Deploy migration: ADD COLUMN nullable, no constraint
  ─────────────────────────────────────────────────────────────
  ALTER TABLE users ADD COLUMN onboarding_source VARCHAR(64);
  -- Fast path. No rewrite. Milliseconds.
  -- Existing rows: NULL (allowed for now)
  -- New rows: application code writes the value explicitly

  ─────────────────────────────────────────────────────────────────────

  Step 2 — Backfill existing rows in batches (background job)
  ─────────────────────────────────────────────────────────────
  UPDATE users
    SET onboarding_source = 'organic'
  WHERE onboarding_source IS NULL
    AND id BETWEEN ${batchStart} AND ${batchEnd};
  -- No table lock. Small batches. Runs offline at low priority.
  -- Takes minutes, not seconds — but table stays fully live.

  ─────────────────────────────────────────────────────────────────────

  Step 3 — Deploy migration: add NOT NULL + default after backfill
  ─────────────────────────────────────────────────────────────────
  ALTER TABLE users
    ALTER COLUMN onboarding_source SET DEFAULT 'organic',
    ALTER COLUMN onboarding_source SET NOT NULL;
  -- Postgres 12+: NOT NULL validated via constraint, not rewrite
  -- Only safe once zero NULLs remain in the column

  ─────────────────────────────────────────────────────────────────────

This spreads the work across three deploys and a background job. The table is never locked for more than a few milliseconds at any point. Users see nothing.

The backfill script processes rows in chunks of 10,000 with a 50ms pause between batches to avoid overwhelming the primary.

scripts/backfill-onboarding-source.ts

import { Pool } from 'pg';

const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const BATCH_SIZE = 10_000;
const SLEEP_MS = 50;

async function sleep(ms: number) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function backfill() {
  const { rows: [{ max }] } = await pool.query('SELECT MAX(id) FROM users');
  const maxId: number = max;

  let batchStart = 1;
  let totalUpdated = 0;

  while (batchStart <= maxId) {
    const batchEnd = batchStart + BATCH_SIZE - 1;

    const result = await pool.query(
      `UPDATE users
         SET onboarding_source = 'organic'
       WHERE onboarding_source IS NULL
         AND id BETWEEN $1 AND $2`,
      [batchStart, batchEnd]
    );

    totalUpdated += result.rowCount ?? 0;
    console.log(`Backfilled up to id=${batchEnd} | total=${totalUpdated}`);  // eslint-disable-line

    batchStart = batchEnd + 1;
    await sleep(SLEEP_MS); // Yield time to avoid replica lag buildup
  }

  console.log(`Backfill complete. ${totalUpdated} rows updated.`);
  await pool.end();
}

backfill().catch(console.error);

The backfill took about 18 minutes across 2.4 million rows. Longer than the original migration would have taken if it had worked. The difference is the users table stayed fully accessible the entire time. We ran it at 2 AM on a low-traffic night, verified zero NULL rows, and deployed Step 3. The final NOT NULL constraint addition completed in under 200 milliseconds.

What we changed after this

The incident led to three durable changes in how we manage schemas.

A migration review checklist enforced in CI. Every migration touching a table with more than 100,000 rows must pass a review gate. The checklist flags any ADD COLUMN ... NOT NULL, ADD CONSTRAINT, column type changes, and index creation without CONCURRENTLY. PRs with flagged patterns require sign-off from a second engineer who has read the checklist.
Production row counts in the migration review context. A GitHub Actions step fetches live table sizes from a read replica and posts them as a PR comment alongside any migration file changes. Reviewers no longer have to remember that the users table is 10x larger in production than in staging.
A lock_timeout on all migration runs. We now set lock_timeout = '5s' and statement_timeout = '30s' as session-level settings before any production migration. If a migration cannot acquire its lock within 5 seconds, it fails fast instead of queueing behind live traffic. A failed migration is infinitely better than a silent 14-minute lock.

migrations/run-migration.sql (wrapper we now use)

-- Applied before every production migration
SET lock_timeout = '5s';         -- Fail fast if we can't get the lock
SET statement_timeout = '30s';   -- Kill runaway migrations before they spiral
SET idle_in_transaction_session_timeout = '10s';

-- The actual migration runs here
i 20260315_add_onboarding_source_step1.sql

RESET lock_timeout;
RESET statement_timeout;
RESET idle_in_transaction_session_timeout;

Lessons learned

"The migration that looks harmless is almost always the migration that hurts you. The dangerous ones, you're already careful about."

NOT NULL + DEFAULT is not the same as DEFAULT alone. Postgres 11+ optimised the nullable case; the NOT NULL case is not fully optimised. Decompose: add nullable, backfill, then constrain.
Staging row counts are fiction. A migration that takes 1 second on 50,000 rows will not take 48 seconds on 2.4 million rows; it may hit an entirely different code path with quadratic I/O characteristics. Benchmark migrations against a production-sized snapshot.
Add indexes CONCURRENTLY, always. CREATE INDEX holds a lock for the full build duration, same as the table rewrite. There is no operational reason to ever create an index non-concurrently on a live table.
lock_timeout is your circuit breaker. If your migration cannot acquire its lock immediately, it should abort cleanly rather than queue behind 400 live connections and make the outage worse.
Run migrations separately from code deployments. We now treat schema migrations as independent deploys with their own approval step. A migration problem should never take down application code that was working fine before.

The outage lasted 14 minutes. The incident review lasted two hours. The checklist and CI changes took a day to implement. We've shipped dozens of schema migrations since, including several on tables larger than users, and none of them have required so much as a maintenance window. Adding a constraint to 2.4 million rows of existing data is not a schema change. It's a data migration. We treat it like one now.