Our Blue-Green Deploy Corrupted 3,200 Records in 11 Minutes
← Back
March 17, 2026CI/CD9 min read

Our Blue-Green Deploy Corrupted 3,200 Records in 11 Minutes

Published March 17, 20269 min read

It was supposed to be our safest deploy in months. Blue-green on ECS. Health checks before any traffic shift. Automatic rollback if anything looked wrong. We had rehearsed the runbook three times. At 2:47 PM on a Thursday, I clicked deploy and watched the green environment spin up cleanly. Eleven minutes later, we had corrupted data in production and 3,200 failed subscription writes we had no idea about — because the API was still returning 200s.

The culprit: a three-word SQL command. ALTER TABLE users RENAME COLUMN. It executed in 83 milliseconds. The damage took 6 hours to fully scope and 3 days of backfill jobs to repair.


Production Failure

We were refactoring our subscription model. The column subscription_plan in our users table had accumulated 3 years of technical debt — it stored plan name, plan tier, and billing interval all jammed into a single VARCHAR field, pipe-delimited: pro|monthly|29. The new architecture introduced a proper plans table with a foreign key plan_id INTEGER on users.

The migration file looked like this:

sql
-- 001_subscription_refactor.sql
ALTER TABLE users ADD COLUMN plan_id INTEGER REFERENCES plans(id);

UPDATE users u
SET plan_id = p.id
FROM plans p
WHERE p.legacy_key = split_part(u.subscription_plan, '|', 1);

ALTER TABLE users RENAME COLUMN subscription_plan TO subscription_plan_deprecated;

ALTER TABLE users ALTER COLUMN plan_id SET NOT NULL;

I had added the RENAME step intentionally — to "keep the old column visible" during the deploy window in case we needed to rollback. What I hadn't thought through: our v1.14.2 containers were still running UPDATE users SET subscription_plan = $1 WHERE id = $2 on every subscription event. The rename happened in millisecond 83 of the migration. The old containers had no idea the column was gone.

CloudWatch fired at 2:49 PM. 5xx error rate jumped from 0.08% to 31% on our background job workers. The API layer — which read from a Redis cache — kept returning 200s. Users saw nothing wrong. Our SLA dashboard stayed green. Meanwhile, 8 blue ECS tasks were hammering the database with writes that Postgres was silently rejecting: ERROR: column "subscription_plan" of relation "users" does not exist.

By 2:58 PM the green tasks passed health checks and traffic shifted. The 5xx rate dropped. Slack went quiet. The deploy looked successful. It wasn't until the next morning, when our analytics team flagged that subscription update events had a 100% failure rate between 14:49 and 15:00 UTC, that we understood what had happened.


False Assumptions

We had convinced ourselves that blue-green deployments were immune to migration problems. The mental model we were operating under:

  WHAT WE THOUGHT HAPPENED
  ─────────────────────────────────────────────────────────

  14:47  Migrations run against database
         └─ Takes ~2 seconds

  14:47  Green tasks start pulling new Docker image
         └─ Old (blue) tasks still serving 100% traffic

  14:58  Green tasks pass health checks
         └─ Traffic shifts: 0% blue → 100% green

  14:58  Blue tasks drain and stop
         └─ Zero downtime. Zero data loss. ✓

  WHAT ACTUALLY HAPPENED
  ─────────────────────────────────────────────────────────

  14:47  Migrations run (column renamed in 83ms)
         └─ subscription_plan column NO LONGER EXISTS

  14:47  Blue tasks keep running with old code
         └─ Writing to subscription_plan → ERROR (silent)

  14:58  Green tasks take traffic (11 min gap)
         └─ 3,200 writes rejected, cached reads hide it

  Next day Analytics flags 100% failure rate in window ✗

The key false assumption: blue-green protects you from migration errors. It does not. Blue-green protects you from application code errors. If your migration is not backward-compatible with the version of code that was running one deployment ago, blue-green makes the problem worse — because the old containers run longer and generate more failures before you notice.

We also assumed our monitoring would catch write failures immediately. It didn't, because:

  • The failures were in async background workers, not in the synchronous API path
  • Workers swallowed the Postgres error, logged it, and marked the job as "failed" for retry — which never came
  • Read endpoints hit Redis cache, so users got stale-but-valid data with a 200
  • Our 5xx alarm had a 5-minute evaluation window — it fired, but we saw it as "transient" during deploy

Investigation

The analytics team's Slack message arrived at 9:14 AM the next day: "Subscription update events are missing for a ~11-minute window yesterday afternoon. Looks like 14:49–15:00 UTC. Did we have a deploy?"

First stop: RDS slow query log. We filter for errors during that window:

sql
SELECT query, error_severity, message, query_start
FROM pg_log_entries
WHERE query_start BETWEEN '2026-03-16 14:48:00' AND '2026-03-16 15:01:00'
  AND error_severity = 'ERROR'
ORDER BY query_start;

The result was unambiguous: 3,247 rows, all with the same message:

  ERROR:  column "subscription_plan" of relation "users" does not exist
  LINE 1: UPDATE users SET subscription_plan = $1 WHERE id = $2
                           ^
  CONTEXT: PL/pgSQL function process_subscription_event(integer,text)

Every single one originated from our background job queue, specifically the process_subscription_event function. The errors started at 14:49:03 UTC — exactly 83 seconds after the migration file began executing — and stopped at 15:00:12 UTC, when the last blue ECS task received its SIGTERM and drained.

Cross-referencing with our job queue (Postgres-backed with pgboss):

sql
SELECT COUNT(*), state
FROM pgboss.job
WHERE name = 'subscription-update'
  AND created_on BETWEEN '2026-03-16 14:48:00' AND '2026-03-16 15:01:00'
GROUP BY state;

-- state     | count
-- ----------+-------
-- failed    |  3247
-- completed |   112

3,247 failed jobs. All marked as permanently failed because our retry policy had retryLimit: 0 for subscription events (to prevent double-processing). We had built idempotency protection into the system and accidentally made failures unrecoverable.


Root Cause

The root cause was deploying a non-backward-compatible migration alongside application code, in a system where the old version of the application would continue running for up to 15 minutes after the migration executed.

A column rename breaks backward compatibility in both directions:

  • Old code → new schema: Writes to subscription_plan fail (column doesn't exist)
  • New code → old schema: Reads from plan_id fail (column doesn't exist yet)

The rename is a breaking change. Full stop. Blue-green deployments require a window where both old and new code can run simultaneously against the same database. That means every migration must be compatible with the N-1 version of the application. Renames are never compatible.

The secondary cause: our worker queue had retryLimit: 0 specifically because we were afraid of duplicate subscription processing. Reasonable protection. But it meant that any transient error during a deploy window was permanently fatal — with no alerting.


The Fix

We rebuilt our migration strategy around the Expand-Contract pattern (sometimes called Parallel Change). Every schema change now happens in three separate deploys:

  EXPAND-CONTRACT MIGRATION PATTERN
  ─────────────────────────────────────────────────────────

  DEPLOY 1 — Expand (backward-compatible addition)
  ├─ Migration: ADD COLUMN plan_id INTEGER (nullable)
  ├─ Migration: Backfill plan_id from subscription_plan
  └─ App code: Writes to BOTH columns. Reads from old column.

  DEPLOY 2 — Migrate (shift reads to new column)
  ├─ No migration needed
  └─ App code: Writes to BOTH columns. Reads from new column.
       └─ Old code still writing to old column = safe

  DEPLOY 3 — Contract (remove old column)
  ├─ Migration: DROP COLUMN subscription_plan
  └─ App code: Writes to plan_id only. Reads from plan_id.

  Result: Zero downtime. Zero data loss. Fully reversible at each step.
  ─────────────────────────────────────────────────────────

The updated migration for Deploy 1:

sql
-- Deploy 1: Expand only — always backward-compatible
ALTER TABLE users ADD COLUMN plan_id INTEGER REFERENCES plans(id);

-- Backfill existing rows (run in batches to avoid lock)
DO $$
DECLARE
  batch_size INT := 1000;
  offset_val INT := 0;
  rows_updated INT;
BEGIN
  LOOP
    UPDATE users u
    SET plan_id = p.id
    FROM plans p
    WHERE p.legacy_key = split_part(u.subscription_plan, '|', 1)
      AND u.plan_id IS NULL
      AND u.id IN (
        SELECT id FROM users WHERE plan_id IS NULL
        ORDER BY id LIMIT batch_size OFFSET offset_val
      );

    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    EXIT WHEN rows_updated = 0;
    offset_val := offset_val + batch_size;
    PERFORM pg_sleep(0.05); -- breathe between batches
  END LOOP;
END;
$$;

And the application code change for Deploy 2 — dual-write with read-from-new:

typescript
// Before: single column write
async function updateSubscription(userId: number, planKey: string): Promise {
  await db.query(
    'UPDATE users SET subscription_plan = $1 WHERE id = $2',
    [planKey, userId]
  );
}

// Deploy 2: dual-write, read from new column
async function updateSubscription(userId: number, planId: number, planKey: string): Promise {
  await db.query(
    `UPDATE users
     SET plan_id = $1,
         subscription_plan = $2   -- keep old column alive for Deploy 1 rollback
     WHERE id = $3`,
    [planId, planKey, userId]
  );
}

// Deploy 3: drop old column write (migration removes it)
async function updateSubscription(userId: number, planId: number): Promise {
  await db.query(
    'UPDATE users SET plan_id = $1 WHERE id = $2',
    [planId, userId]
  );
}

We also fixed the worker retry policy. The issue wasn't retryLimit: 0 — the issue was why we needed it. We replaced the "no retry" protection with a proper idempotency key approach:

typescript
// Idempotency via unique constraint — safe to retry
await db.query(`
  INSERT INTO subscription_events (user_id, plan_id, event_type, idempotency_key, processed_at)
  VALUES ($1, $2, $3, $4, NOW())
  ON CONFLICT (idempotency_key) DO NOTHING
`, [userId, planId, eventType, idempotencyKey]);

// Now retries are safe — duplicates are ignored by the DB
// retryLimit: 3, retryDelay: exponential

The backfill for the 3,247 failed jobs ran over the following weekend. We replayed events from our message queue (SQS had 7-day retention), matched each to a user, and re-applied the subscription update with the new code. Total records recovered: 3,241 of 3,247. Six users had deleted their accounts in the interim and we credited them manually.


Lessons Learned

1. Blue-green deployments do not protect you from bad migrations. They protect you from bad application code. If your migration is not backward-compatible with the previous version of your application, blue-green amplifies the blast radius — the old containers run longer and generate more failures before the shift completes.

2. Column renames are always breaking changes. There is no safe way to rename a column in a zero-downtime system without going through Expand-Contract. The rename must be decomposed into: add new column → migrate data → shift reads → shift writes → drop old column. Minimum 3 deploys.

3. Silent failures are more dangerous than loud ones. Our system returned 200s to users while writing nothing to the database. A loud failure (500 to the user) would have triggered immediate investigation. Silence let the window grow to 11 minutes. Invest heavily in async job failure alerting — not just API error rates.

4. retryLimit: 0 on background jobs is almost always wrong. If you need idempotency (you do), implement it properly with a unique constraint or idempotency key — not by disabling retries. A job that can't retry is a job that will silently corrupt your data the first time anything transient goes wrong.

5. Migration linting should be part of CI. We now run migration safety checks in our CI pipeline that flag: renames, NOT NULL additions without defaults, index creation without CONCURRENTLY, and column type changes. The pipeline blocks the PR if any migration is flagged as potentially breaking. Total CI addition: 4 lines of config, 0 runtime overhead.

The total cost of this incident: 11 minutes of data corruption, 3,247 failed job records, 3 days of engineer time for investigation and backfill, and one very uncomfortable all-hands retro. The fix cost us 3 extra deploys. That's the deal.

Share this
← All Posts9 min read