Our Blue-Green Deploy Corrupted 3,200 Records in 11 Minutes
It was supposed to be our safest deploy in months. Blue-green on ECS. Health checks before any traffic shift. Automatic rollback if anything looked wrong. We had rehearsed the runbook three times. At 2:47 PM on a Thursday, I clicked deploy and watched the green environment spin up cleanly. Eleven minutes later, we had corrupted data in production and 3,200 failed subscription writes we had no idea about — because the API was still returning 200s.
The culprit: a three-word SQL command. ALTER TABLE users RENAME COLUMN. It executed in
83 milliseconds. The damage took 6 hours to fully scope and 3 days of backfill jobs to repair.
Production Failure
We were refactoring our subscription model. The column subscription_plan in our
users table had accumulated 3 years of technical debt — it stored plan name, plan tier,
and billing interval all jammed into a single VARCHAR field, pipe-delimited: pro|monthly|29.
The new architecture introduced a proper plans table with a foreign key plan_id INTEGER
on users.
The migration file looked like this:
-- 001_subscription_refactor.sql
ALTER TABLE users ADD COLUMN plan_id INTEGER REFERENCES plans(id);
UPDATE users u
SET plan_id = p.id
FROM plans p
WHERE p.legacy_key = split_part(u.subscription_plan, '|', 1);
ALTER TABLE users RENAME COLUMN subscription_plan TO subscription_plan_deprecated;
ALTER TABLE users ALTER COLUMN plan_id SET NOT NULL;
I had added the RENAME step intentionally — to "keep the old column visible" during
the deploy window in case we needed to rollback. What I hadn't thought through: our v1.14.2 containers
were still running UPDATE users SET subscription_plan = $1 WHERE id = $2 on every
subscription event. The rename happened in millisecond 83 of the migration. The old containers had
no idea the column was gone.
CloudWatch fired at 2:49 PM. 5xx error rate jumped from 0.08% to 31% on our
background job workers. The API layer — which read from a Redis cache — kept returning 200s.
Users saw nothing wrong. Our SLA dashboard stayed green. Meanwhile, 8 blue ECS tasks were
hammering the database with writes that Postgres was silently rejecting:
ERROR: column "subscription_plan" of relation "users" does not exist.
By 2:58 PM the green tasks passed health checks and traffic shifted. The 5xx rate dropped. Slack went quiet. The deploy looked successful. It wasn't until the next morning, when our analytics team flagged that subscription update events had a 100% failure rate between 14:49 and 15:00 UTC, that we understood what had happened.
False Assumptions
We had convinced ourselves that blue-green deployments were immune to migration problems. The mental model we were operating under:
WHAT WE THOUGHT HAPPENED
─────────────────────────────────────────────────────────
14:47 Migrations run against database
└─ Takes ~2 seconds
14:47 Green tasks start pulling new Docker image
└─ Old (blue) tasks still serving 100% traffic
14:58 Green tasks pass health checks
└─ Traffic shifts: 0% blue → 100% green
14:58 Blue tasks drain and stop
└─ Zero downtime. Zero data loss. ✓
WHAT ACTUALLY HAPPENED
─────────────────────────────────────────────────────────
14:47 Migrations run (column renamed in 83ms)
└─ subscription_plan column NO LONGER EXISTS
14:47 Blue tasks keep running with old code
└─ Writing to subscription_plan → ERROR (silent)
14:58 Green tasks take traffic (11 min gap)
└─ 3,200 writes rejected, cached reads hide it
Next day Analytics flags 100% failure rate in window ✗
The key false assumption: blue-green protects you from migration errors. It does not. Blue-green protects you from application code errors. If your migration is not backward-compatible with the version of code that was running one deployment ago, blue-green makes the problem worse — because the old containers run longer and generate more failures before you notice.
We also assumed our monitoring would catch write failures immediately. It didn't, because:
- The failures were in async background workers, not in the synchronous API path
- Workers swallowed the Postgres error, logged it, and marked the job as "failed" for retry — which never came
- Read endpoints hit Redis cache, so users got stale-but-valid data with a 200
- Our 5xx alarm had a 5-minute evaluation window — it fired, but we saw it as "transient" during deploy
Investigation
The analytics team's Slack message arrived at 9:14 AM the next day: "Subscription update events are missing for a ~11-minute window yesterday afternoon. Looks like 14:49–15:00 UTC. Did we have a deploy?"
First stop: RDS slow query log. We filter for errors during that window:
SELECT query, error_severity, message, query_start
FROM pg_log_entries
WHERE query_start BETWEEN '2026-03-16 14:48:00' AND '2026-03-16 15:01:00'
AND error_severity = 'ERROR'
ORDER BY query_start;
The result was unambiguous: 3,247 rows, all with the same message:
ERROR: column "subscription_plan" of relation "users" does not exist
LINE 1: UPDATE users SET subscription_plan = $1 WHERE id = $2
^
CONTEXT: PL/pgSQL function process_subscription_event(integer,text)
Every single one originated from our background job queue, specifically the
process_subscription_event function. The errors started at 14:49:03 UTC —
exactly 83 seconds after the migration file began executing — and stopped at 15:00:12 UTC,
when the last blue ECS task received its SIGTERM and drained.
Cross-referencing with our job queue (Postgres-backed with pgboss):
SELECT COUNT(*), state
FROM pgboss.job
WHERE name = 'subscription-update'
AND created_on BETWEEN '2026-03-16 14:48:00' AND '2026-03-16 15:01:00'
GROUP BY state;
-- state | count
-- ----------+-------
-- failed | 3247
-- completed | 112
3,247 failed jobs. All marked as permanently failed because our retry policy had
retryLimit: 0 for subscription events (to prevent double-processing). We had
built idempotency protection into the system and accidentally made failures unrecoverable.
Root Cause
The root cause was deploying a non-backward-compatible migration alongside application code, in a system where the old version of the application would continue running for up to 15 minutes after the migration executed.
A column rename breaks backward compatibility in both directions:
- Old code → new schema: Writes to
subscription_planfail (column doesn't exist) - New code → old schema: Reads from
plan_idfail (column doesn't exist yet)
The rename is a breaking change. Full stop. Blue-green deployments require a window where both old and new code can run simultaneously against the same database. That means every migration must be compatible with the N-1 version of the application. Renames are never compatible.
The secondary cause: our worker queue had retryLimit: 0 specifically because we
were afraid of duplicate subscription processing. Reasonable protection. But it meant that
any transient error during a deploy window was permanently fatal — with no alerting.
The Fix
We rebuilt our migration strategy around the Expand-Contract pattern (sometimes called Parallel Change). Every schema change now happens in three separate deploys:
EXPAND-CONTRACT MIGRATION PATTERN
─────────────────────────────────────────────────────────
DEPLOY 1 — Expand (backward-compatible addition)
├─ Migration: ADD COLUMN plan_id INTEGER (nullable)
├─ Migration: Backfill plan_id from subscription_plan
└─ App code: Writes to BOTH columns. Reads from old column.
DEPLOY 2 — Migrate (shift reads to new column)
├─ No migration needed
└─ App code: Writes to BOTH columns. Reads from new column.
└─ Old code still writing to old column = safe
DEPLOY 3 — Contract (remove old column)
├─ Migration: DROP COLUMN subscription_plan
└─ App code: Writes to plan_id only. Reads from plan_id.
Result: Zero downtime. Zero data loss. Fully reversible at each step.
─────────────────────────────────────────────────────────
The updated migration for Deploy 1:
-- Deploy 1: Expand only — always backward-compatible
ALTER TABLE users ADD COLUMN plan_id INTEGER REFERENCES plans(id);
-- Backfill existing rows (run in batches to avoid lock)
DO $$
DECLARE
batch_size INT := 1000;
offset_val INT := 0;
rows_updated INT;
BEGIN
LOOP
UPDATE users u
SET plan_id = p.id
FROM plans p
WHERE p.legacy_key = split_part(u.subscription_plan, '|', 1)
AND u.plan_id IS NULL
AND u.id IN (
SELECT id FROM users WHERE plan_id IS NULL
ORDER BY id LIMIT batch_size OFFSET offset_val
);
GET DIAGNOSTICS rows_updated = ROW_COUNT;
EXIT WHEN rows_updated = 0;
offset_val := offset_val + batch_size;
PERFORM pg_sleep(0.05); -- breathe between batches
END LOOP;
END;
$$;
And the application code change for Deploy 2 — dual-write with read-from-new:
// Before: single column write
async function updateSubscription(userId: number, planKey: string): Promise {
await db.query(
'UPDATE users SET subscription_plan = $1 WHERE id = $2',
[planKey, userId]
);
}
// Deploy 2: dual-write, read from new column
async function updateSubscription(userId: number, planId: number, planKey: string): Promise {
await db.query(
`UPDATE users
SET plan_id = $1,
subscription_plan = $2 -- keep old column alive for Deploy 1 rollback
WHERE id = $3`,
[planId, planKey, userId]
);
}
// Deploy 3: drop old column write (migration removes it)
async function updateSubscription(userId: number, planId: number): Promise {
await db.query(
'UPDATE users SET plan_id = $1 WHERE id = $2',
[planId, userId]
);
}
We also fixed the worker retry policy. The issue wasn't retryLimit: 0 — the issue was
why we needed it. We replaced the "no retry" protection with a proper idempotency key
approach:
// Idempotency via unique constraint — safe to retry
await db.query(`
INSERT INTO subscription_events (user_id, plan_id, event_type, idempotency_key, processed_at)
VALUES ($1, $2, $3, $4, NOW())
ON CONFLICT (idempotency_key) DO NOTHING
`, [userId, planId, eventType, idempotencyKey]);
// Now retries are safe — duplicates are ignored by the DB
// retryLimit: 3, retryDelay: exponential
The backfill for the 3,247 failed jobs ran over the following weekend. We replayed events from our message queue (SQS had 7-day retention), matched each to a user, and re-applied the subscription update with the new code. Total records recovered: 3,241 of 3,247. Six users had deleted their accounts in the interim and we credited them manually.
Lessons Learned
1. Blue-green deployments do not protect you from bad migrations. They protect you from bad application code. If your migration is not backward-compatible with the previous version of your application, blue-green amplifies the blast radius — the old containers run longer and generate more failures before the shift completes.
2. Column renames are always breaking changes. There is no safe way to rename a column in a zero-downtime system without going through Expand-Contract. The rename must be decomposed into: add new column → migrate data → shift reads → shift writes → drop old column. Minimum 3 deploys.
3. Silent failures are more dangerous than loud ones. Our system returned 200s to users while writing nothing to the database. A loud failure (500 to the user) would have triggered immediate investigation. Silence let the window grow to 11 minutes. Invest heavily in async job failure alerting — not just API error rates.
4. retryLimit: 0 on background jobs is almost always wrong.
If you need idempotency (you do), implement it properly with a unique constraint or idempotency
key — not by disabling retries. A job that can't retry is a job that will silently corrupt
your data the first time anything transient goes wrong.
5. Migration linting should be part of CI. We now run
migration safety checks
in our CI pipeline that flag: renames, NOT NULL additions without defaults, index creation without
CONCURRENTLY, and column type changes. The pipeline blocks the PR if any migration
is flagged as potentially breaking. Total CI addition: 4 lines of config, 0 runtime overhead.
The total cost of this incident: 11 minutes of data corruption, 3,247 failed job records, 3 days of engineer time for investigation and backfill, and one very uncomfortable all-hands retro. The fix cost us 3 extra deploys. That's the deal.