March 17, 2026CI/CD9 min read

Our Blue-Green Deploy Corrupted 3,200 Records in 11 Minutes

Published March 17, 20269 min read

It was supposed to be our safest deploy in months. Blue-green on ECS, health checks before any traffic shift, automatic rollback if anything looked wrong. We'd rehearsed the runbook three times. 2:47 PM Thursday, I clicked deploy and watched the green environment spin up cleanly.

Eleven minutes later we had corrupted data in production and 3,200 failed subscription writes, none of which we knew about, because the API was still returning 200s.

The culprit was a three-word SQL command. ALTER TABLE users RENAME COLUMN. It executed in 83 milliseconds. The damage took 6 hours to fully scope and 3 days of backfill jobs to repair.

Production failure

We were refactoring our subscription model. The subscription_plan column in the users table had accumulated 3 years of technical debt. It stored plan name, plan tier, and billing interval jammed into a single VARCHAR field, pipe-delimited: pro|monthly|29. The new architecture introduced a proper plans table with a foreign key plan_id INTEGER on users.

The migration file looked like this:

sql

-- 001_subscription_refactor.sql
ALTER TABLE users ADD COLUMN plan_id INTEGER REFERENCES plans(id);

UPDATE users u
SET plan_id = p.id
FROM plans p
WHERE p.legacy_key = split_part(u.subscription_plan, '|', 1);

ALTER TABLE users RENAME COLUMN subscription_plan TO subscription_plan_deprecated;

ALTER TABLE users ALTER COLUMN plan_id SET NOT NULL;

I had added the RENAME step intentionally, to "keep the old column visible" during the deploy window in case we needed to rollback. What I hadn't thought through was that our v1.14.2 containers were still running UPDATE users SET subscription_plan = $1 WHERE id = $2 on every subscription event. The rename happened in millisecond 83 of the migration. The old containers had no idea the column was gone.

CloudWatch fired at 2:49 PM. 5xx error rate jumped from 0.08% to 31% on our background job workers. The API layer (which read from a Redis cache) kept returning 200s. Users saw nothing wrong. Our SLA dashboard stayed green. Meanwhile, 8 blue ECS tasks were hammering the database with writes that Postgres was silently rejecting: ERROR: column "subscription_plan" of relation "users" does not exist.

By 2:58 PM the green tasks passed health checks and traffic shifted. The 5xx rate dropped. Slack went quiet. The deploy looked successful. We didn't understand what had actually happened until 9:14 AM the next day, when analytics flagged a 100% failure rate on subscription update events between 14:49 and 15:00 UTC.

False assumptions

We had convinced ourselves that blue-green deployments were immune to migration problems. The mental model we were operating under:

  WHAT WE THOUGHT HAPPENED
  ─────────────────────────────────────────────────────────

  14:47  Migrations run against database
         └─ Takes ~2 seconds

  14:47  Green tasks start pulling new Docker image
         └─ Old (blue) tasks still serving 100% traffic

  14:58  Green tasks pass health checks
         └─ Traffic shifts: 0% blue → 100% green

  14:58  Blue tasks drain and stop
         └─ Zero downtime. Zero data loss. ✓

  WHAT ACTUALLY HAPPENED
  ─────────────────────────────────────────────────────────

  14:47  Migrations run (column renamed in 83ms)
         └─ subscription_plan column NO LONGER EXISTS

  14:47  Blue tasks keep running with old code
         └─ Writing to subscription_plan → ERROR (silent)

  14:58  Green tasks take traffic (11 min gap)
         └─ 3,200 writes rejected, cached reads hide it

  Next day Analytics flags 100% failure rate in window ✗

The key false assumption: blue-green protects you from migration errors. It doesn't. It protects you from application code errors. If your migration isn't backward-compatible with the version of code that was running one deployment ago, blue-green makes things worse, because the old containers run longer and generate more failures before the shift completes.

We also assumed our monitoring would catch write failures immediately. It didn't, for a handful of reasons.

The failures were in async background workers, not the synchronous API path.
Workers swallowed the Postgres error, logged it, and marked the job as "failed" for retry, which never came.
Read endpoints hit Redis cache, so users got stale-but-valid data with a 200.
Our 5xx alarm had a 5-minute evaluation window. It fired, and we saw it as "transient" during deploy. That's the part I can't really defend, in retrospect.

Investigation

The analytics team's Slack message: "Subscription update events are missing for a ~11-minute window yesterday afternoon. Looks like 14:49 to 15:00 UTC. Did we have a deploy?"

First stop, RDS slow query log. We filter for errors during that window.

sql

SELECT query, error_severity, message, query_start
FROM pg_log_entries
WHERE query_start BETWEEN '2026-03-16 14:48:00' AND '2026-03-16 15:01:00'
  AND error_severity = 'ERROR'
ORDER BY query_start;

The result was unambiguous: 3,247 rows, all with the same message.

  ERROR:  column "subscription_plan" of relation "users" does not exist
  LINE 1: UPDATE users SET subscription_plan = $1 WHERE id = $2
                           ^
  CONTEXT: PL/pgSQL function process_subscription_event(integer,text)

Every single one came from the background job queue, specifically the process_subscription_event function. Errors started at 14:49:03 UTC (83 seconds after the migration file began executing) and stopped at 15:00:12 UTC, when the last blue ECS task received its SIGTERM and drained.

Cross-referencing with the job queue (Postgres-backed with pgboss):

sql

SELECT COUNT(*), state
FROM pgboss.job
WHERE name = 'subscription-update'
  AND created_on BETWEEN '2026-03-16 14:48:00' AND '2026-03-16 15:01:00'
GROUP BY state;

-- state     | count
-- ----------+-------
-- failed    |  3247
-- completed |   112

3,247 failed jobs. All marked as permanently failed because our retry policy had retryLimit: 0 for subscription events to prevent double-processing. We had built idempotency protection into the system and accidentally made failures unrecoverable.

Root cause

The root cause was deploying a non-backward-compatible migration alongside application code in a system where the old version of the application would continue running for up to 15 minutes after the migration executed.

A column rename breaks backward compatibility in both directions. Old code against the new schema fails to write (column doesn't exist). New code against the old schema fails to read (column doesn't exist yet). A rename is a breaking change, full stop. Blue-green deployments require a window where both old and new code can run simultaneously against the same database. Every migration must be compatible with the N-1 version of the application. Renames are never compatible.

The secondary cause: the worker queue had retryLimit: 0 specifically because we were afraid of duplicate subscription processing. Reasonable protection, except it meant any transient error during a deploy window was permanently fatal with no alerting.

The fix

We rebuilt the migration strategy around the Expand-Contract pattern (sometimes called Parallel Change). Every schema change now happens in three separate deploys.

  EXPAND-CONTRACT MIGRATION PATTERN
  ─────────────────────────────────────────────────────────

  DEPLOY 1 — Expand (backward-compatible addition)
  ├─ Migration: ADD COLUMN plan_id INTEGER (nullable)
  ├─ Migration: Backfill plan_id from subscription_plan
  └─ App code: Writes to BOTH columns. Reads from old column.

  DEPLOY 2 — Migrate (shift reads to new column)
  ├─ No migration needed
  └─ App code: Writes to BOTH columns. Reads from new column.
       └─ Old code still writing to old column = safe

  DEPLOY 3 — Contract (remove old column)
  ├─ Migration: DROP COLUMN subscription_plan
  └─ App code: Writes to plan_id only. Reads from plan_id.

  Result: Zero downtime. Zero data loss. Fully reversible at each step.
  ─────────────────────────────────────────────────────────

The updated migration for Deploy 1:

sql

-- Deploy 1: Expand only — always backward-compatible
ALTER TABLE users ADD COLUMN plan_id INTEGER REFERENCES plans(id);

-- Backfill existing rows (run in batches to avoid lock)
DO $$
DECLARE
  batch_size INT := 1000;
  offset_val INT := 0;
  rows_updated INT;
BEGIN
  LOOP
    UPDATE users u
    SET plan_id = p.id
    FROM plans p
    WHERE p.legacy_key = split_part(u.subscription_plan, '|', 1)
      AND u.plan_id IS NULL
      AND u.id IN (
        SELECT id FROM users WHERE plan_id IS NULL
        ORDER BY id LIMIT batch_size OFFSET offset_val
      );

    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    EXIT WHEN rows_updated = 0;
    offset_val := offset_val + batch_size;
    PERFORM pg_sleep(0.05); -- breathe between batches
  END LOOP;
END;
$$;

And the application code change for Deploy 2, dual-write with read-from-new.

typescript

// Before: single column write
async function updateSubscription(userId: number, planKey: string): Promise {
  await db.query(
    'UPDATE users SET subscription_plan = $1 WHERE id = $2',
    [planKey, userId]
  );
}

// Deploy 2: dual-write, read from new column
async function updateSubscription(userId: number, planId: number, planKey: string): Promise {
  await db.query(
    `UPDATE users
     SET plan_id = $1,
         subscription_plan = $2   -- keep old column alive for Deploy 1 rollback
     WHERE id = $3`,
    [planId, planKey, userId]
  );
}

// Deploy 3: drop old column write (migration removes it)
async function updateSubscription(userId: number, planId: number): Promise {
  await db.query(
    'UPDATE users SET plan_id = $1 WHERE id = $2',
    [planId, userId]
  );
}

We also fixed the worker retry policy. The issue wasn't retryLimit: 0, it was why we needed it. We replaced the "no retry" protection with a proper idempotency key approach.

typescript

// Idempotency via unique constraint — safe to retry
await db.query(`
  INSERT INTO subscription_events (user_id, plan_id, event_type, idempotency_key, processed_at)
  VALUES ($1, $2, $3, $4, NOW())
  ON CONFLICT (idempotency_key) DO NOTHING
`, [userId, planId, eventType, idempotencyKey]);

// Now retries are safe — duplicates are ignored by the DB
// retryLimit: 3, retryDelay: exponential

The backfill for the 3,247 failed jobs ran over the following weekend. We replayed events from the message queue (SQS had 7-day retention), matched each to a user, and re-applied the subscription update with the new code. 3,241 of 3,247 records recovered. Six users had deleted their accounts in the interim and we credited them manually.

Lessons learned

Blue-green deployments do not protect you from bad migrations. They protect you from bad application code. If your migration isn't backward-compatible with the previous version of the application, blue-green amplifies blast radius. Old containers run longer and generate more failures before the shift completes.

Column renames are always breaking changes. There is no safe way to rename a column in a zero-downtime system without going through Expand-Contract: add new column, migrate data, shift reads, shift writes, drop old column. Minimum 3 deploys.

Silent failures are more dangerous than loud ones. Our system returned 200s to users while writing nothing to the database. A 500 to the user would have triggered immediate investigation. Silence let the window grow to 11 minutes. Invest in async job failure alerting, not just API error rates.

retryLimit: 0 on background jobs is almost always wrong. If you need idempotency (you do), implement it with a unique constraint or idempotency key, not by disabling retries. A job that can't retry is a job that will silently corrupt data the first time anything transient goes wrong.

Migration linting should be part of CI. We now run migration safety checks in CI that flag renames, NOT NULL additions without defaults, index creation without CONCURRENTLY, and column type changes. The pipeline blocks the PR if any migration is flagged as potentially breaking. 4 lines of config, zero runtime overhead.

The total cost: 11 minutes of data corruption, 3,247 failed job records, 3 days of engineer time for investigation and backfill, and one very uncomfortable all-hands retro. The fix cost us 3 extra deploys.