The AI-Generated Migration That Dropped the Wrong Index and Took Our API from 3ms to 45 Seconds
← Back
March 17, 2026AI10 min read

The AI-Generated Migration That Dropped the Wrong Index and Took Our API from 3ms to 45 Seconds

Published March 17, 202610 min read

Friday 4:47 PM. Migration deployed. Slack goes quiet. Weekend plans confirmed. Then Monday 9:03 AM — Datadog fires every alert simultaneously. API p99 latency: 45 seconds. Error rate: 62%. We had deployed what looked like a routine index cleanup using an AI assistant. We had let it drop the one index that kept our most-used query alive.

This is the story of 3.5 hours of debugging, an 8-minute fix, and the lesson about what "redundant index" actually means when an AI is the one deciding.

Production Failure

Monday morning looked like a DDoS from the outside. Every dashboard was red. Our GET /orders endpoint — the most-called route in our entire API, hit roughly 4,200 times per minute at peak — had gone from a 3ms average response time to hanging for 30–45 seconds before timing out. 62% of requests were returning 504s. On-call engineer got paged at 9:03 AM. I was on a call by 9:11 AM.

The first assumption: traffic spike. We checked. Monday morning traffic was actually lighter than usual — 3,100 RPM instead of the normal 4,200. The problem wasn't volume. Something was wrong with the query itself.

False Assumptions

We spent the first 45 minutes looking in entirely the wrong direction. Here's what we assumed, in order:

  • App-level regression: We reviewed every commit merged since Friday. No changes to the orders endpoint handler, no ORM changes, no connection pool config touched. Dead end.
  • Connection pool exhaustion: We'd seen this before — connections pile up when queries are slow. We checked pg_stat_activity. 147 active connections, all waiting. But that was a symptom, not the cause.
  • Autoscaling lag: Maybe ECS hadn't scaled fast enough for Monday morning traffic. We manually bumped the task count from 6 to 12. Latency stayed at 45 seconds. Traffic hadn't caused this.

At 10:02 AM — nearly an hour in — we finally looked at the database itself.

The Investigation

pg_stat_statements told the story in under 30 seconds. One query was responsible for 94.3% of total database time:

sql
SELECT *
FROM   orders
WHERE  user_id = $1
  AND  status != 'deleted'
ORDER  BY created_at DESC
LIMIT  20;

Mean execution time: 38,412ms. Calls in the last hour: 186,000. Total time attributed: over 7 billion milliseconds. A query that used to take 3ms was now taking 38 seconds. We ran EXPLAIN ANALYZE:

sql — EXPLAIN ANALYZE output (before fix)
Seq Scan on orders  (cost=0.00..312,847.23 rows=18 width=412)
                    (actual time=0.041..38312.847 rows=18 loops=1)
  Filter: ((user_id = 10482) AND ((status)::text <> 'deleted'::text))
  Rows Removed by Filter: 2,299,982
Planning Time: 0.312 ms
Execution Time: 38,312.914 ms

Sequential scan. All 2.3 million rows. For a query that should have used an index and returned 18 rows in under 5ms. The index was gone.

  BEFORE vs AFTER THE MIGRATION
  ─────────────────────────────────────────────────────────────────────

  BEFORE (Friday 4:46 PM)
  ┌─────────────────────────────────────────────┐
  │  orders table (2.3M rows)                   │
  │                                             │
  │  Indexes:                                   │
  │  ✓ idx_orders_pkey          (id)            │
  │  ✓ idx_orders_user_id       (user_id)       │  ← AI called this "redundant"
  │  ✓ idx_orders_user_id_ca    (user_id,       │  ← AI DROPPED THIS ONE
  │                              status,        │
  │                              created_at)    │
  │  ✓ idx_orders_status        (status)        │
  └─────────────────────────────────────────────┘

  Query planner uses:  idx_orders_user_id_ca
  Execution time:      3ms

  AFTER (Friday 4:47 PM — migration applied)
  ┌─────────────────────────────────────────────┐
  │  orders table (2.3M rows)                   │
  │                                             │
  │  Indexes:                                   │
  │  ✓ idx_orders_pkey          (id)            │
  │  ✓ idx_orders_user_id       (user_id)       │
  │  ✗ idx_orders_user_id_ca    DROPPED         │
  │  ✓ idx_orders_status        (status)        │
  └─────────────────────────────────────────────┘

  Query planner uses:  sequential scan (2.3M rows)
  Execution time:      38,312ms

Root Cause

We found the Friday migration commit. The message read: "chore: remove redundant indexes on orders table (AI suggested)". We opened the file.

sql — the migration that caused the outage
-- Migration: 20260314_cleanup_redundant_indexes.sql
-- Generated with Cursor AI, reviewed by developer

-- Remove redundant single-column index on user_id
-- (superseded by composite index idx_orders_user_id_ca)
DROP INDEX CONCURRENTLY idx_orders_user_id;

-- Remove composite index — redundant now that we have
-- idx_orders_user_id covering the user_id lookups
DROP INDEX CONCURRENTLY idx_orders_user_id_ca;

Read that twice. The AI dropped both indexes — the single-column one first (reasonable), then the composite one it had just finished calling a replacement for (catastrophic). The reasoning in the comment was circular and wrong: it said the composite index was "redundant" because the single-column index covered user_id lookups. But the composite index existed precisely to serve the WHERE user_id = $1 AND status != 'deleted' ORDER BY created_at DESC query — the filter and sort were what made it valuable, not just the leading column.

The AI had seen two indexes both starting with user_id, decided one made the other redundant, then confused itself about which was the redundant one. The developer reviewed the migration, saw that dropping both seemed logical given the comment, approved it, and shipped it on a Friday afternoon.

Here's what the query planner was actually using the composite index for:

sql — why the composite index was irreplaceable
-- The query (simplified)
SELECT * FROM orders
WHERE  user_id = 10482        -- uses leading column of composite index
  AND  status != 'deleted'    -- filtered in index scan, avoids heap fetch
ORDER  BY created_at DESC     -- ORDER satisfied by index — no sort step
LIMIT  20;

-- With idx_orders_user_id_ca (user_id, status, created_at):
-- Index scan → 20 rows, 3ms
-- Postgres walks the index in reverse created_at order,
-- stops at LIMIT 20 without touching the full table.

-- With only idx_orders_user_id (user_id only):
-- Index scan → 47,000 rows for this user_id
-- Then filter status != 'deleted' in memory
-- Then sort by created_at in memory
-- In practice: planner chose seq scan as cheaper

-- Without either index:
-- Full sequential scan → 2,300,000 rows → 38 seconds

The Fix

Once we understood the cause, the fix was 8 minutes of reindexing:

sql
CREATE INDEX CONCURRENTLY idx_orders_user_id_ca
ON orders (user_id, status, created_at DESC);

CONCURRENTLY meant the table stayed online during the build. The index finished at 10:47 AM — 8 minutes and 12 seconds after we ran the command. The moment it completed, EXPLAIN ANALYZE flipped back to an index scan, mean query time dropped from 38,412ms to 2.8ms, and error rate fell from 62% to 0.1% within 90 seconds as the connection queue drained.

Total downtime: 1 hour 44 minutes. Revenue impact: we run a SaaS with hourly billing cycles — we issued credits to 847 customers affected during the window. The conversation about Friday deployments started that afternoon.

Why the Developer Approved It

I want to be clear: this wasn't negligence. The developer who approved the migration was experienced. The problem was that the AI's comment sounded authoritative and its reasoning was internally consistent — just wrong. When you're doing a "routine cleanup" migration, you don't open pgAdmin and run EXPLAIN ANALYZE on every query that might use the index you're dropping. You read the comment, it makes sense, you ship it.

We checked pg_stat_user_indexes after the fact. The composite index had been scanned 47 million times in the 30 days before it was dropped. The single-column index had been scanned zero times in the same period — because the query planner always preferred the composite. If we had run that check before the migration, we would have immediately seen that the "redundant" composite index was our most-used index on the table.

  WHAT pg_stat_user_indexes WOULD HAVE SHOWN
  ─────────────────────────────────────────────────────────────────────

  indexrelname               | idx_scans | idx_tup_read | last_used
  ───────────────────────────┼───────────┼──────────────┼────────────
  idx_orders_user_id_ca      | 47,213,884| 941,877,680  | 2026-03-13
  idx_orders_user_id         |         0 |            0 | never
  idx_orders_status          | 2,841,022 | 56,820,440   | 2026-03-13
  idx_orders_pkey            | 8,104,991 | 8,104,991    | 2026-03-14

  The "redundant" composite index: 47M scans in 30 days.
  The "useful" single-column index: 0 scans ever.

  The AI had it exactly backwards.

Lessons

This incident changed our process for schema changes in three concrete ways:

1. Never drop an index without checking pg_stat_user_indexes first. One query tells you everything: how many times each index has been scanned in the last 30 days. If a number is non-zero, the index is in use. If it is zero, it might be safe to drop — after checking if it exists for write-path constraints or backup query patterns. This check takes 10 seconds and would have prevented this entire outage.

sql — mandatory pre-drop check
SELECT
    indexrelname,
    idx_scan          AS scans_last_reset,
    idx_tup_read      AS tuples_read,
    idx_tup_fetch     AS tuples_fetched,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM   pg_stat_user_indexes
WHERE  relname = 'orders'
ORDER  BY idx_scan DESC;

2. AI-generated schema migrations require human EXPLAIN ANALYZE review. Not a comment review. Not a logic review. Actual query plan verification against production data. We now require any migration that drops an index to include a before/after EXPLAIN ANALYZE run in the PR description, captured against a production replica. If the plan changes from an index scan to a sequential scan, the migration does not merge.

3. AI tools optimise for local coherence, not system-wide correctness. The AI saw two indexes on user_id and applied a rule that seemed locally correct: "the more specific index supersedes the less specific one." It didn't know that the query planner had spent months proving the opposite. It didn't read pg_stat_user_indexes. It made a plausible inference from incomplete information and wrote a confident comment to justify it. This is exactly how AI assists create silent catastrophes — the code compiles, the migration runs, the comment makes sense, and nothing looks wrong until Monday morning.

4. Friday deployments of schema changes are banned. This one is obvious in hindsight and embarrassing to admit we needed the lesson. Schema migrations now go out Tuesday through Thursday, before 3 PM. Database changes need eyes on them for at least a few hours before the weekend.

The AI assistant didn't cause this outage. We did — by treating its output as an answer instead of a starting point. Index cleanup is exactly the kind of task where AI looks most convincing and has the least context to work with. The database knows your query patterns. The AI doesn't.

Share this
← All Posts10 min read