The AI-Generated Migration That Dropped the Wrong Index and Took Our API from 3ms to 45 Seconds
Friday 4:47 PM. Migration deployed. Slack goes quiet. Weekend plans confirmed. Then Monday 9:03 AM — Datadog fires every alert simultaneously. API p99 latency: 45 seconds. Error rate: 62%. We had deployed what looked like a routine index cleanup using an AI assistant. We had let it drop the one index that kept our most-used query alive.
This is the story of 3.5 hours of debugging, an 8-minute fix, and the lesson about what "redundant index" actually means when an AI is the one deciding.
Production Failure
Monday morning looked like a DDoS from the outside. Every dashboard was red.
Our GET /orders endpoint — the most-called route in our entire API, hit
roughly 4,200 times per minute at peak — had gone from a 3ms average response time to
hanging for 30–45 seconds before timing out. 62% of requests were returning 504s.
On-call engineer got paged at 9:03 AM. I was on a call by 9:11 AM.
The first assumption: traffic spike. We checked. Monday morning traffic was actually lighter than usual — 3,100 RPM instead of the normal 4,200. The problem wasn't volume. Something was wrong with the query itself.
False Assumptions
We spent the first 45 minutes looking in entirely the wrong direction. Here's what we assumed, in order:
- App-level regression: We reviewed every commit merged since Friday. No changes to the orders endpoint handler, no ORM changes, no connection pool config touched. Dead end.
- Connection pool exhaustion: We'd seen this before — connections pile
up when queries are slow. We checked
pg_stat_activity. 147 active connections, all waiting. But that was a symptom, not the cause. - Autoscaling lag: Maybe ECS hadn't scaled fast enough for Monday morning traffic. We manually bumped the task count from 6 to 12. Latency stayed at 45 seconds. Traffic hadn't caused this.
At 10:02 AM — nearly an hour in — we finally looked at the database itself.
The Investigation
pg_stat_statements told the story in under 30 seconds. One query was
responsible for 94.3% of total database time:
SELECT *
FROM orders
WHERE user_id = $1
AND status != 'deleted'
ORDER BY created_at DESC
LIMIT 20;
Mean execution time: 38,412ms. Calls in the last hour: 186,000.
Total time attributed: over 7 billion milliseconds. A query that used
to take 3ms was now taking 38 seconds. We ran EXPLAIN ANALYZE:
Seq Scan on orders (cost=0.00..312,847.23 rows=18 width=412)
(actual time=0.041..38312.847 rows=18 loops=1)
Filter: ((user_id = 10482) AND ((status)::text <> 'deleted'::text))
Rows Removed by Filter: 2,299,982
Planning Time: 0.312 ms
Execution Time: 38,312.914 ms
Sequential scan. All 2.3 million rows. For a query that should have used an index and returned 18 rows in under 5ms. The index was gone.
BEFORE vs AFTER THE MIGRATION ───────────────────────────────────────────────────────────────────── BEFORE (Friday 4:46 PM) ┌─────────────────────────────────────────────┐ │ orders table (2.3M rows) │ │ │ │ Indexes: │ │ ✓ idx_orders_pkey (id) │ │ ✓ idx_orders_user_id (user_id) │ ← AI called this "redundant" │ ✓ idx_orders_user_id_ca (user_id, │ ← AI DROPPED THIS ONE │ status, │ │ created_at) │ │ ✓ idx_orders_status (status) │ └─────────────────────────────────────────────┘ Query planner uses: idx_orders_user_id_ca Execution time: 3ms AFTER (Friday 4:47 PM — migration applied) ┌─────────────────────────────────────────────┐ │ orders table (2.3M rows) │ │ │ │ Indexes: │ │ ✓ idx_orders_pkey (id) │ │ ✓ idx_orders_user_id (user_id) │ │ ✗ idx_orders_user_id_ca DROPPED │ │ ✓ idx_orders_status (status) │ └─────────────────────────────────────────────┘ Query planner uses: sequential scan (2.3M rows) Execution time: 38,312ms
Root Cause
We found the Friday migration commit. The message read: "chore: remove redundant indexes on orders table (AI suggested)". We opened the file.
-- Migration: 20260314_cleanup_redundant_indexes.sql
-- Generated with Cursor AI, reviewed by developer
-- Remove redundant single-column index on user_id
-- (superseded by composite index idx_orders_user_id_ca)
DROP INDEX CONCURRENTLY idx_orders_user_id;
-- Remove composite index — redundant now that we have
-- idx_orders_user_id covering the user_id lookups
DROP INDEX CONCURRENTLY idx_orders_user_id_ca;
Read that twice. The AI dropped both indexes — the single-column one first
(reasonable), then the composite one it had just finished calling a replacement for
(catastrophic). The reasoning in the comment was circular and wrong: it said the composite
index was "redundant" because the single-column index covered user_id lookups.
But the composite index existed precisely to serve the WHERE user_id = $1 AND status != 'deleted' ORDER BY created_at DESC
query — the filter and sort were what made it valuable, not just the leading column.
The AI had seen two indexes both starting with user_id, decided one made
the other redundant, then confused itself about which was the redundant one.
The developer reviewed the migration, saw that dropping both seemed logical given the
comment, approved it, and shipped it on a Friday afternoon.
Here's what the query planner was actually using the composite index for:
-- The query (simplified)
SELECT * FROM orders
WHERE user_id = 10482 -- uses leading column of composite index
AND status != 'deleted' -- filtered in index scan, avoids heap fetch
ORDER BY created_at DESC -- ORDER satisfied by index — no sort step
LIMIT 20;
-- With idx_orders_user_id_ca (user_id, status, created_at):
-- Index scan → 20 rows, 3ms
-- Postgres walks the index in reverse created_at order,
-- stops at LIMIT 20 without touching the full table.
-- With only idx_orders_user_id (user_id only):
-- Index scan → 47,000 rows for this user_id
-- Then filter status != 'deleted' in memory
-- Then sort by created_at in memory
-- In practice: planner chose seq scan as cheaper
-- Without either index:
-- Full sequential scan → 2,300,000 rows → 38 seconds
The Fix
Once we understood the cause, the fix was 8 minutes of reindexing:
CREATE INDEX CONCURRENTLY idx_orders_user_id_ca
ON orders (user_id, status, created_at DESC);
CONCURRENTLY meant the table stayed online during the build.
The index finished at 10:47 AM — 8 minutes and 12 seconds after we ran the command.
The moment it completed, EXPLAIN ANALYZE flipped back to an index scan,
mean query time dropped from 38,412ms to 2.8ms, and error rate fell from 62% to 0.1%
within 90 seconds as the connection queue drained.
Total downtime: 1 hour 44 minutes. Revenue impact: we run a SaaS with hourly billing cycles — we issued credits to 847 customers affected during the window. The conversation about Friday deployments started that afternoon.
Why the Developer Approved It
I want to be clear: this wasn't negligence. The developer who approved the migration
was experienced. The problem was that the AI's comment sounded authoritative and its
reasoning was internally consistent — just wrong. When you're doing a "routine cleanup"
migration, you don't open pgAdmin and run EXPLAIN ANALYZE
on every query that might use the index you're dropping. You read the comment, it makes
sense, you ship it.
We checked pg_stat_user_indexes after the fact. The composite index had
been scanned 47 million times in the 30 days before it was dropped.
The single-column index had been scanned zero times in the same period
— because the query planner always preferred the composite. If we had run that check
before the migration, we would have immediately seen that the "redundant" composite index
was our most-used index on the table.
WHAT pg_stat_user_indexes WOULD HAVE SHOWN ───────────────────────────────────────────────────────────────────── indexrelname | idx_scans | idx_tup_read | last_used ───────────────────────────┼───────────┼──────────────┼──────────── idx_orders_user_id_ca | 47,213,884| 941,877,680 | 2026-03-13 idx_orders_user_id | 0 | 0 | never idx_orders_status | 2,841,022 | 56,820,440 | 2026-03-13 idx_orders_pkey | 8,104,991 | 8,104,991 | 2026-03-14 The "redundant" composite index: 47M scans in 30 days. The "useful" single-column index: 0 scans ever. The AI had it exactly backwards.
Lessons
This incident changed our process for schema changes in three concrete ways:
1. Never drop an index without checking pg_stat_user_indexes first. One query tells you everything: how many times each index has been scanned in the last 30 days. If a number is non-zero, the index is in use. If it is zero, it might be safe to drop — after checking if it exists for write-path constraints or backup query patterns. This check takes 10 seconds and would have prevented this entire outage.
SELECT
indexrelname,
idx_scan AS scans_last_reset,
idx_tup_read AS tuples_read,
idx_tup_fetch AS tuples_fetched,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE relname = 'orders'
ORDER BY idx_scan DESC;
2. AI-generated schema migrations require human EXPLAIN ANALYZE review.
Not a comment review. Not a logic review. Actual query plan verification against production
data. We now require any migration that drops an index to include a before/after
EXPLAIN ANALYZE run in the PR description, captured against a production
replica. If the plan changes from an index scan to a sequential scan, the migration
does not merge.
3. AI tools optimise for local coherence, not system-wide correctness.
The AI saw two indexes on user_id and applied a rule that seemed locally
correct: "the more specific index supersedes the less specific one." It didn't know that
the query planner had spent months proving the opposite. It didn't read
pg_stat_user_indexes. It made a plausible inference from incomplete
information and wrote a confident comment to justify it. This is exactly how AI
assists create silent catastrophes — the code compiles, the migration runs, the comment
makes sense, and nothing looks wrong until Monday morning.
4. Friday deployments of schema changes are banned. This one is obvious in hindsight and embarrassing to admit we needed the lesson. Schema migrations now go out Tuesday through Thursday, before 3 PM. Database changes need eyes on them for at least a few hours before the weekend.
The AI assistant didn't cause this outage. We did — by treating its output as an answer instead of a starting point. Index cleanup is exactly the kind of task where AI looks most convincing and has the least context to work with. The database knows your query patterns. The AI doesn't.