How a NOT NULL Column Migration Locked Our Users Table for 14 Minutes
← Back
March 15, 2026Database10 min read

How a NOT NULL Column Migration Locked Our Users Table for 14 Minutes

Published March 15, 202610 min read

It was 10:03 AM on a Monday when every status monitor in our dashboard turned red simultaneously. Not a partial outage. Not a single service. Everything — the API, the dashboard, the mobile backend — all down at once. The Slack pager-duty channel exploded. The timing was not a coincidence. We had just deployed a "routine" schema migration two minutes earlier.

The migration looked completely harmless in code review. One line. Adding a column. Adding a column with a safe default value. What followed was 14 minutes of a full exclusive lock on our 2.4-million-row users table — and a lesson about Postgres internals that I will carry for the rest of my career.


Production Failure

We were running a multi-tenant SaaS platform. The users table had 2.4 million rows across all tenants — not enormous by data warehouse standards, but a core operational table hit on nearly every authenticated API request. Our deployment pipeline ran database migrations as the first step, before rolling out new application code. That ordering decision, combined with what seemed like a trivial column addition, caused a complete service blackout.

The migration was part of a feature to track how users had been onboarded — an onboarding_source column to distinguish users who signed up via organic search, a referral link, or a sales-assisted demo. The column needed a sensible default for existing rows.

migrations/20260315_add_onboarding_source.sql (the offender)
-- This migration ran fine in staging (50,000 rows).
-- In production (2.4M rows), it acquired an exclusive lock for 14 minutes.

ALTER TABLE users
  ADD COLUMN onboarding_source VARCHAR(64) NOT NULL DEFAULT 'organic';

In our staging environment — 50,000 rows — this migration ran in 1.2 seconds. In production, with 2.4 million rows, it ran for 14 minutes and held an exclusive lock on the entire table for the full duration. No reads. No writes. Nothing.

14 min Full table lock duration
2.4M Rows rewritten
100% API error rate
1.2s Same migration in staging

False Assumptions

Our first instinct was to blame the application deployment, not the migration. The new code had shipped simultaneously. We immediately rolled back the application tier — new pods swapped for old ones within 3 minutes. The outage continued. Whatever was wrong was not in the application layer.

The second theory was a connection pool exhaustion cascade. We'd seen this before: a slow operation holds connections open, the pool fills, new connections queue, latency spikes, and the system seizes. We checked PgBouncer metrics: 400 connections waiting. That matched the symptom, but it was the symptom, not the cause. Something was holding those connections, not the other way around.

It took us seven minutes to think to check what was actually running inside Postgres:

postgres — active queries at 10:06 AM
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;

--  pid   |  duration   |  state  | query
-- -------+-------------+---------+------------------------------------------
-- 48291  | 00:03:42    | active  | ALTER TABLE users ADD COLUMN onboarding_source...
-- 48302  | 00:03:42    | active  | SELECT id, email FROM users WHERE...       (waiting)
-- 48317  | 00:03:41    | active  | UPDATE users SET last_login_at = ...        (waiting)
-- 48331  | 00:03:41    | active  | SELECT * FROM users WHERE tenant_id = ...   (waiting)
-- ... (394 more waiting queries)

There it was. PID 48291 had been running for 3 minutes and 42 seconds. The ALTER TABLE had an exclusive lock on users. Every other query — SELECTs, UPDATEs, anything touching the table — was queued behind it. We had 397 queries piled up and waiting.


Root Cause: What Postgres Does With NOT NULL + DEFAULT

The root cause was a fundamental misunderstanding of how Postgres handles schema changes internally — one that is extremely easy to get wrong even with years of Postgres experience.

In Postgres versions prior to 11, adding any column with a DEFAULT value required a full table rewrite. Postgres had to scan every existing row and physically write the new column's value into each one. For a table with 2.4 million rows and variable-width VARCHAR data, that meant re-serializing 2.4 million tuples on disk — all while holding an ACCESS EXCLUSIVE lock that blocks every read and write to the table.

Postgres 11 introduced a major optimization: adding a column with a non-volatile DEFAULT (a constant value, not NOW() or a function) no longer requires a table rewrite. Postgres stores the default in the catalog and serves it virtually for old rows, only writing the value on future row updates. We knew this. We were on Postgres 14. So what went wrong?

The NOT NULL constraint.

Here is the precise behavior: in Postgres 14, adding a column with DEFAULT alone is safe and fast. Adding NOT NULL DEFAULT forces Postgres to validate that no existing row has a NULL in that column — which requires scanning every row — and in many cases triggers the legacy full-rewrite code path depending on the column type and storage. Our VARCHAR(64) NOT NULL DEFAULT hit the slow path. Postgres rewrote the entire table.

  POSTGRES ALTER TABLE — FAST PATH vs. SLOW PATH

  ─────────────────────────────────────────────────────────────────────

  ADD COLUMN col VARCHAR(64) DEFAULT 'organic'   ← Postgres 11+: FAST ✓
  ───────────────────────────────────────────────
  No table rewrite. Default stored in pg_attrdef.
  Old rows serve the default value virtually.
  Lock duration: milliseconds.

  ─────────────────────────────────────────────────────────────────────

  ADD COLUMN col VARCHAR(64) NOT NULL DEFAULT 'organic'  ← SLOW PATH ✗
  ──────────────────────────────────────────────────────
  Postgres must validate NOT NULL across all existing rows.
  Triggers full table rewrite to physically embed the default.
  Acquires ACCESS EXCLUSIVE lock for entire duration.

  ACCESS EXCLUSIVE lock blocks:
    ✗ SELECT          ✗ INSERT          ✗ UPDATE
    ✗ DELETE          ✗ VACUUM          ✗ any concurrent access

  2.4M rows × rewrite overhead = 14 minutes of total outage.

  ─────────────────────────────────────────────────────────────────────

The irony was cruel: we added NOT NULL to enforce data integrity. That exact constraint turned a 1-second migration into a 14-minute outage.

The staging environment never surfaced this because with 50,000 rows, the rewrite completed in 1.2 seconds — fast enough that we never noticed the lock was even held. The behaviour was identical. The scale was not.


Architecture Fix: The Safe Zero-Downtime Migration Pattern

Fixing this required rethinking how we write any schema migration that touches a large, live table. The correct pattern is to decompose a single dangerous operation into three safe steps deployed across separate releases.

  SAFE MIGRATION: 3-STEP ZERO-DOWNTIME PATTERN

  ─────────────────────────────────────────────────────────────────────

  Step 1 — Deploy migration: ADD COLUMN nullable, no constraint
  ─────────────────────────────────────────────────────────────
  ALTER TABLE users ADD COLUMN onboarding_source VARCHAR(64);
  -- Fast path. No rewrite. Milliseconds.
  -- Existing rows: NULL (allowed for now)
  -- New rows: application code writes the value explicitly

  ─────────────────────────────────────────────────────────────────────

  Step 2 — Backfill existing rows in batches (background job)
  ─────────────────────────────────────────────────────────────
  UPDATE users
    SET onboarding_source = 'organic'
  WHERE onboarding_source IS NULL
    AND id BETWEEN ${batchStart} AND ${batchEnd};
  -- No table lock. Small batches. Runs offline at low priority.
  -- Takes minutes, not seconds — but table stays fully live.

  ─────────────────────────────────────────────────────────────────────

  Step 3 — Deploy migration: add NOT NULL + default after backfill
  ─────────────────────────────────────────────────────────────────
  ALTER TABLE users
    ALTER COLUMN onboarding_source SET DEFAULT 'organic',
    ALTER COLUMN onboarding_source SET NOT NULL;
  -- Postgres 12+: NOT NULL validated via constraint, not rewrite
  -- Only safe once zero NULLs remain in the column

  ─────────────────────────────────────────────────────────────────────

This pattern spreads the work across three deployments and a background job. The table is never locked for more than a few milliseconds at any point. Users never see any interruption.

For the backfill step, we wrote a simple script that processes rows in chunks of 10,000, sleeping 50ms between batches to avoid overwhelming the primary:

scripts/backfill-onboarding-source.ts
import { Pool } from 'pg';

const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const BATCH_SIZE = 10_000;
const SLEEP_MS = 50;

async function sleep(ms: number) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function backfill() {
  const { rows: [{ max }] } = await pool.query('SELECT MAX(id) FROM users');
  const maxId: number = max;

  let batchStart = 1;
  let totalUpdated = 0;

  while (batchStart <= maxId) {
    const batchEnd = batchStart + BATCH_SIZE - 1;

    const result = await pool.query(
      `UPDATE users
         SET onboarding_source = 'organic'
       WHERE onboarding_source IS NULL
         AND id BETWEEN $1 AND $2`,
      [batchStart, batchEnd]
    );

    totalUpdated += result.rowCount ?? 0;
    console.log(`Backfilled up to id=${batchEnd} | total=${totalUpdated}`);  // eslint-disable-line

    batchStart = batchEnd + 1;
    await sleep(SLEEP_MS); // Yield time to avoid replica lag buildup
  }

  console.log(`Backfill complete. ${totalUpdated} rows updated.`);
  await pool.end();
}

backfill().catch(console.error);

The backfill script ran across 2.4 million rows in approximately 18 minutes — longer than the original migration would have taken — but the users table stayed fully accessible the entire time. We ran it at 2 AM on a low-traffic night and verified zero NULL rows before deploying Step 3. The final NOT NULL constraint addition completed in under 200 milliseconds.


What We Changed After This

The incident led to three durable changes in how we manage database schemas:

  • Migration review checklist, enforced in CI. Every migration touching a table with more than 100,000 rows must pass a review gate. The checklist explicitly flags: any ADD COLUMN ... NOT NULL, ADD CONSTRAINT, column type changes, and index creation without CONCURRENTLY. Pull requests containing flagged patterns require a sign-off from a second engineer who has read the checklist.
  • Production row counts in migration review context. We added a GitHub Actions step that fetches live table sizes from a read replica and posts them as a PR comment alongside any migration file changes. Reviewers no longer have to remember that the users table is 10× larger in production than in staging.
  • lock_timeout on all migration runs. We now set lock_timeout = '5s' and statement_timeout = '30s' as session-level settings before any production migration. If a migration cannot acquire its lock within 5 seconds (because traffic is ongoing), it fails fast instead of queueing behind live traffic and compounding the problem. A failed migration is infinitely better than a silent 14-minute lock.
migrations/run-migration.sql (wrapper we now use)
-- Applied before every production migration
SET lock_timeout = '5s';         -- Fail fast if we can't get the lock
SET statement_timeout = '30s';   -- Kill runaway migrations before they spiral
SET idle_in_transaction_session_timeout = '10s';

-- The actual migration runs here
i 20260315_add_onboarding_source_step1.sql

RESET lock_timeout;
RESET statement_timeout;
RESET idle_in_transaction_session_timeout;

Lessons Learned

"The migration that looks harmless is almost always the migration that hurts you. The dangerous ones, you're already careful about."
  • NOT NULL + DEFAULT is not the same as DEFAULT alone. Postgres 11+ optimised nullable column addition with a default. It did not fully optimise the NOT NULL case. Always decompose: add nullable, backfill, then constrain.
  • Staging row counts are fiction. A migration that takes 1 second on 50,000 rows will not take 48 seconds on 2.4 million rows — it may hit an entirely different code path with quadratic I/O characteristics. Always benchmark migrations against a production-sized snapshot.
  • Add indexes CONCURRENTLY, always. CREATE INDEX holds a lock for the full build duration, same as the table rewrite. CREATE INDEX CONCURRENTLY does not. There is no operational reason to ever create an index non-concurrently on a live table.
  • lock_timeout is your circuit breaker. Setting lock_timeout before migrations ensures that if your migration cannot acquire its lock immediately, it aborts cleanly rather than queueing behind 400 live connections and making the outage worse.
  • Run migrations separately from code deployments. We now treat schema migrations as independent deploys with their own approval step, not as a pre-flight step bundled into the application rollout. This decoupling means a migration problem never takes down application code that was working fine before.

The outage lasted 14 minutes. Incident review lasted two hours. The checklist and CI changes took a day to implement. We've shipped dozens of schema migrations since — including several on tables larger than users — and none of them have required so much as a maintenance window.

One SQL keyword cost us a Monday morning. NOT NULL. The lesson isn't that constraints are bad — it's that adding a constraint to 2.4 million rows of existing data is not a schema change, it's a data migration. Treat it like one.

— Darshan Turakhia, now reads every ALTER TABLE like it's a production change order (because it is)
Share this
← All Posts10 min read