The Friday Deploy That Taught Me to Respect PostgreSQL
The deploy was on a Wednesday. It worked perfectly — fast, clean, zero errors. For three days, roughly 200 users clicked the new Monthly Summary dashboard and saw their sales data in under a second. Then a large enterprise client onboarded. They had 600,000 deals. At 11:14 AM on a Monday, they clicked the dashboard for the first time — and within nine minutes, the entire platform was down.
The culprit was four lines of SQL. A JOIN that looked harmless on a small dataset, utterly catastrophic at scale. This is the story of that incident, what I found when I dug into PostgreSQL's internals, and the principles I now enforce religiously before any query ships to production.
The Feature
We were building a CRM for small businesses. The new feature was a "Monthly Summary" page — a dashboard showing each user's deal pipeline: how many deals were created, won, lost, and their total value, grouped by month and sales rep.
The query seemed perfectly reasonable:
SELECT
u.name AS rep_name,
DATE_TRUNC('month', d.created_at) AS month,
COUNT(*) AS total_deals,
COUNT(*) FILTER (WHERE d.status = 'won') AS won,
COUNT(*) FILTER (WHERE d.status = 'lost') AS lost,
SUM(d.value) AS pipeline_value
FROM deals d
JOIN users u ON u.id = d.assigned_to
JOIN companies c ON c.id = d.company_id
WHERE d.tenant_id = $1
GROUP BY u.name, DATE_TRUNC('month', d.created_at)
ORDER BY month DESC, pipeline_value DESC;
In development, our deals table had about 4,000 rows. The query ran in 12ms.
Beautiful. Shipping it.
Production had 8.3 million rows.
The Cascade
The feature had been live for three days without issue — because all existing customers were small, with a few thousand deals each. Then the enterprise client logged in. PostgreSQL started a sequential scan of 8.3 million rows to find their tenant's deals, then joined each one against the users table, then the companies table. All while our mobile app's aggressive retry logic turned a slow response into a storm.
11:14 AM Monday — Disaster Timeline
─────────────────────────────────────────────────────────────────────
11:14:02 Enterprise tenant (600K deals) hits /dashboard/summary
PostgreSQL begins seq scan: deals (8.3M rows)
CPU climbs from 12% → 34%
11:14:09 3 more enterprise users open the dashboard (auto-loaded on login)
4 concurrent long-running queries in flight
PostgreSQL autovacuum deferred — competing for I/O
11:14:31 Mobile app starts retrying timed-out requests
Retry storm begins. CPU: 89%
11:14:47 Connection pool exhausted (max_connections: 100)
New login attempts → "sorry, too many connections"
CPU: 100%
11:14:54 INSERT queries start timing out
New deal creation: FAILING across all tenants
11:15:06 Health check endpoint fails (it queries the DB)
Load balancer marks primary instance UNHEALTHY
Routes to second instance — which shares the same DB
11:17:00 Full service outage — all 40,000 users affected
PagerDuty fires 🚨
11:23:00 Rollback decision made. Rollback begins.
11:26:00 Service restored.
Damage: 12 minutes total downtime, ~8,000 active users affected.
What made it catastrophic wasn't just the slow query — it was the retry storm. Every failed request retried three times with exponential backoff. As the DB slowed, concurrent queries multiplied, making it worse. A textbook thundering herd.
Diagnosing in Production
The rollback took three minutes. Service was restored. But we had a feature users now expected — we couldn't just silently never ship it. We needed to understand exactly what went wrong.
PostgreSQL has remarkable introspection tooling if you know where to look. First stop:
pg_stat_statements, which records cumulative query statistics across executions.
-- Find worst offenders by total time
SELECT
LEFT(query, 80) AS query_preview,
calls,
ROUND(total_exec_time::numeric, 2) AS total_ms,
ROUND(mean_exec_time::numeric, 2) AS mean_ms,
ROUND(stddev_exec_time::numeric, 2) AS stddev_ms,
rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
-- Check what's currently running (useful during live incidents)
SELECT
pid,
now() - query_start AS duration,
state,
wait_event_type,
wait_event,
LEFT(query, 100) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC;
The summary dashboard query was consuming 73% of total database time — from just 47 minutes of being live. Mean execution time: 44.3 seconds. Max observed: 3 minutes 12 seconds.
Next, EXPLAIN ANALYZE on the actual query with a production-sized tenant:
EXPLAIN ANALYZE output (simplified — actual plan abbreviated):
Gather (cost=2847923..2891023 rows=31200 width=72)
(actual time=44291..44312 rows=847 loops=1)
Workers Planned: 2 / Launched: 2
→ Hash Join
Hash Cond: (d.assigned_to = u.id)
→ Hash Join
Hash Cond: (d.company_id = c.id)
→ Seq Scan on deals d ← 🔴 THE PROBLEM
Filter: (tenant_id = 1047)
Rows Removed by Filter: 7,698,432
Actual time: 38,412ms ← 38 seconds just scanning!
→ Hash on companies (340K rows)
Actual time: 2,341ms
→ Hash on users
Actual time: 102ms
Planning Time: 12.3 ms
Execution Time: 44,312.6 ms ← 44 seconds. Per request.
KEY FINDINGS:
✗ Seq Scan on deals — 8.3M rows scanned, 7.7M discarded (93%!)
✗ No index on deals.tenant_id
✗ No index on deals.assigned_to
✗ No index on deals.created_at
✗ companies table also doing a full seq scan (340K rows)
No index on tenant_id. For a multi-tenant SaaS database. Every query filtering
by tenant was doing a full table scan and discarding 93% of rows. We'd gotten away with
it for two years because all our other queries were simple primary-key lookups.
The summary query was the first one that needed to scan a large slice of the table.
The Fix — And Why It's More Than "Just Add an Index"
The obvious fix was to add indexes. But we had 8.3 million rows in production.
A naive CREATE INDEX would take an exclusive table lock for potentially
20+ minutes — another outage. We needed a zero-downtime strategy.
Migration Strategy — Zero Downtime Indexing
─────────────────────────────────────────────────────────────────────
WRONG:
CREATE INDEX idx_deals_tenant ON deals(tenant_id);
→ Acquires AccessShareLock on the table during build
→ Blocks all writes for 20–30 minutes
→ Hello, second outage.
RIGHT:
CREATE INDEX CONCURRENTLY idx_deals_tenant ON deals(tenant_id);
→ Builds the index without blocking reads or writes
→ PostgreSQL scans the table twice — slower build, zero impact
→ Available since PostgreSQL 8.2
Our execution order (Monday evening, after traffic died down):
Step 1: CREATE INDEX CONCURRENTLY idx_deals_tenant_created
ON deals(tenant_id, created_at DESC);
Build time: 31 minutes. Zero locks. Zero downtime.
Step 2: CREATE INDEX CONCURRENTLY idx_deals_tenant_status
ON deals(tenant_id, status, assigned_to);
Build time: 28 minutes.
Step 3: CREATE INDEX CONCURRENTLY idx_companies_tenant
ON companies(tenant_id);
Build time: 4 minutes. (Smaller table.)
Step 4: Rewrote dashboard to use a materialized view
refreshed every 15 minutes via pg_cron.
Step 5: Re-deployed the dashboard. 11ms query time. Ship it.
The composite index on (tenant_id, created_at DESC) was the key decision.
PostgreSQL can use it for both the WHERE tenant_id = $1 filter and
the ORDER BY month DESC clause — a single index scan, no sort step.
But indexing alone wasn't enough for our largest tenants. Even with the index, aggregating 600,000 rows per request was unacceptably slow during traffic spikes. The real fix was moving to a pre-aggregated materialized view:
-- Pre-aggregate deal stats into a materialized view
CREATE MATERIALIZED VIEW deal_monthly_summary AS
SELECT
d.tenant_id,
u.id AS user_id,
u.name AS rep_name,
DATE_TRUNC('month', d.created_at) AS month,
COUNT(*) AS total_deals,
COUNT(*) FILTER (WHERE d.status = 'won') AS won,
COUNT(*) FILTER (WHERE d.status = 'lost') AS lost,
SUM(d.value) AS pipeline_value
FROM deals d
JOIN users u ON u.id = d.assigned_to
GROUP BY
d.tenant_id,
u.id,
u.name,
DATE_TRUNC('month', d.created_at)
WITH DATA;
-- Fast tenant lookups on the view (tiny table — ~50K rows total)
CREATE UNIQUE INDEX ON deal_monthly_summary(tenant_id, user_id, month);
-- Refresh every 15 minutes via pg_cron (stale data is acceptable)
SELECT cron.schedule(
'refresh-deal-summary',
'*/15 * * * *',
'REFRESH MATERIALIZED VIEW CONCURRENTLY deal_monthly_summary;'
);
The dashboard now queries the materialized view — ~50,000 pre-aggregated rows instead of 8.3 million raw rows. For our largest enterprise tenant, query time dropped from 44.3 seconds to 11ms. A 4,000× improvement.
What We Changed Forever
This incident permanently changed how we think about database work on the team. The process changes that stuck:
- Every new query gets EXPLAIN ANALYZE against a prod-sized dataset. We maintain a staging environment seeded with anonymised production data specifically for this. If you can't read and explain the query plan, the query doesn't ship.
-
Multi-tenant tables always get composite indexes starting with
tenant_id. This is now item one on our database schema checklist. The tenancy column comes first in every index. No exceptions. - Aggregation queries over large datasets use materialized views. If you're GROUP BY-ing more than ~50K rows on any user-facing endpoint, you need a pre-aggregation strategy. Real-time feels nice; a 44-second query does not.
-
All production index creation uses CONCURRENTLY.
We have a migration linter that rejects any SQL file containing bare
CREATE INDEXwithout the CONCURRENTLY keyword. - Mobile retry logic has circuit breakers. Exponential backoff with jitter, max 3 retries, and a circuit breaker that stops after 5 consecutive failures. Retry storms are preventable — they're a design choice, not an inevitability.
The Rule I Live By Now
"If your query works fine in development, that means nothing. Development has no data. Production is where gravity lives."
The hardest part of this incident wasn't the rollback, or the Slack thread, or writing the
apologetic status page update. It was knowing it was entirely preventable. One
EXPLAIN ANALYZE run against a realistic dataset would have caught the missing
index before it touched a single production user.
That check is now mandatory — on the PR checklist, in the CI pipeline for schema changes, in the onboarding docs for new engineers. Because a four-line SQL query that takes 44 seconds isn't a four-line problem. It's an 8.3 million row problem wearing a four-line disguise.
The database doesn't care how clean your code looks. It only cares about the data.
— Darshan