March 7, 2026Database12 min read

The Friday Deploy That Taught Me to Respect PostgreSQL

Published March 7, 202612 min read

The deploy was on a Wednesday. It worked. Fast, clean, zero errors. For three days, about 200 users clicked the new Monthly Summary dashboard and saw their sales data in under a second. Then a large enterprise client onboarded. They had 600,000 deals. 11:14 AM Monday, they clicked the dashboard for the first time. Within nine minutes the entire platform was down.

The culprit was four lines of SQL. A JOIN that looked harmless on a small dataset and was catastrophic at scale.

The feature

We were building a CRM for small businesses. The new feature was a "Monthly Summary" page. A dashboard showing each user's deal pipeline: deals created, won, lost, and total value, grouped by month and sales rep.

The query looked reasonable.

summary_query.sql (the original)

SELECT
    u.name                              AS rep_name,
    DATE_TRUNC('month', d.created_at)  AS month,
    COUNT(*)                            AS total_deals,
    COUNT(*) FILTER (WHERE d.status = 'won')  AS won,
    COUNT(*) FILTER (WHERE d.status = 'lost') AS lost,
    SUM(d.value)                        AS pipeline_value
FROM deals d
JOIN users    u ON u.id = d.assigned_to
JOIN companies c ON c.id = d.company_id
WHERE d.tenant_id = $1
GROUP BY u.name, DATE_TRUNC('month', d.created_at)
ORDER BY month DESC, pipeline_value DESC;

In dev, our deals table had about 4,000 rows. The query ran in 12ms. Shipping it.

Production had 8.3 million rows. I did not run the query against production before merging. I had "looked at the code," which is not the same thing.

The cascade

The feature had been live for three days without issue because all existing customers were small with a few thousand deals each. Then the enterprise client logged in. Postgres started a sequential scan of 8.3 million rows to find their tenant's deals, then joined each one against the users table, then companies. All while our mobile app's retry logic turned a slow response into a storm.

11:14 AM Monday — Disaster Timeline
─────────────────────────────────────────────────────────────────────

11:14:02  Enterprise tenant (600K deals) hits /dashboard/summary
          PostgreSQL begins seq scan: deals (8.3M rows)
          CPU climbs from 12% → 34%

11:14:09  3 more enterprise users open the dashboard (auto-loaded on login)
          4 concurrent long-running queries in flight
          PostgreSQL autovacuum deferred — competing for I/O

11:14:31  Mobile app starts retrying timed-out requests
          Retry storm begins. CPU: 89%

11:14:47  Connection pool exhausted (max_connections: 100)
          New login attempts → "sorry, too many connections"
          CPU: 100%

11:14:54  INSERT queries start timing out
          New deal creation: FAILING across all tenants

11:15:06  Health check endpoint fails (it queries the DB)
          Load balancer marks primary instance UNHEALTHY
          Routes to second instance — which shares the same DB

11:17:00  Full service outage — all 40,000 users affected
          PagerDuty fires 🚨

11:23:00  Rollback decision made. Rollback begins.
11:26:00  Service restored.
          Damage: 12 minutes total downtime, ~8,000 active users affected.

What made it catastrophic wasn't just the slow query. It was the retry storm. Every failed request retried three times with exponential backoff. As the database slowed, concurrent queries multiplied. Textbook thundering herd.

Diagnosing in production

The rollback took three minutes. Service restored. But users now expected the feature, and we couldn't silently never ship it. We needed to understand what actually happened.

Postgres has remarkable introspection tooling if you know where to look. First stop was pg_stat_statements, which tracks cumulative query statistics.

diagnosis.sql

-- Find worst offenders by total time
SELECT
    LEFT(query, 80)                         AS query_preview,
    calls,
    ROUND(total_exec_time::numeric, 2)      AS total_ms,
    ROUND(mean_exec_time::numeric, 2)       AS mean_ms,
    ROUND(stddev_exec_time::numeric, 2)     AS stddev_ms,
    rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Check what's currently running (useful during live incidents)
SELECT
    pid,
    now() - query_start   AS duration,
    state,
    wait_event_type,
    wait_event,
    LEFT(query, 100)      AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
  AND query_start < now() - interval '5 seconds'
ORDER BY duration DESC;

The summary dashboard query was consuming 73% of total database time, from 47 minutes of being live. Mean execution time: 44.3 seconds. Max observed: 3 minutes 12 seconds. That number made me sit down.

Next, EXPLAIN ANALYZE on the actual query with a production-sized tenant:

EXPLAIN ANALYZE output (simplified — actual plan abbreviated):

Gather  (cost=2847923..2891023 rows=31200 width=72)
        (actual time=44291..44312 rows=847 loops=1)
  Workers Planned: 2 / Launched: 2
  → Hash Join
      Hash Cond: (d.assigned_to = u.id)
      → Hash Join
          Hash Cond: (d.company_id = c.id)
          → Seq Scan on deals d          ← 🔴 THE PROBLEM
               Filter: (tenant_id = 1047)
               Rows Removed by Filter: 7,698,432
               Actual time: 38,412ms ← 38 seconds just scanning!
          → Hash on companies (340K rows)
               Actual time: 2,341ms
      → Hash on users
           Actual time: 102ms

Planning Time:   12.3 ms
Execution Time:  44,312.6 ms  ← 44 seconds. Per request.

KEY FINDINGS:
  ✗ Seq Scan on deals — 8.3M rows scanned, 7.7M discarded (93%!)
  ✗ No index on deals.tenant_id
  ✗ No index on deals.assigned_to
  ✗ No index on deals.created_at
  ✗ companies table also doing a full seq scan (340K rows)

No index on tenant_id. On a multi-tenant SaaS database. Every query filtering by tenant was doing a full table scan and throwing away 93% of rows. We'd gotten away with it for two years because all our other queries were simple primary-key lookups. The summary query was the first one that needed to scan a large slice of the table.

The fix — and why it's more than "Just add an index"

The obvious fix was to add indexes. But with 8.3 million rows in production, a naive CREATE INDEX would take an exclusive table lock for 20+ minutes. Another outage. We needed a zero-downtime strategy.

Migration Strategy — Zero Downtime Indexing
─────────────────────────────────────────────────────────────────────

WRONG:
  CREATE INDEX idx_deals_tenant ON deals(tenant_id);
  → Acquires AccessShareLock on the table during build
  → Blocks all writes for 20–30 minutes
  → Hello, second outage.

RIGHT:
  CREATE INDEX CONCURRENTLY idx_deals_tenant ON deals(tenant_id);
  → Builds the index without blocking reads or writes
  → PostgreSQL scans the table twice — slower build, zero impact
  → Available since PostgreSQL 8.2

Our execution order (Monday evening, after traffic died down):

  Step 1: CREATE INDEX CONCURRENTLY idx_deals_tenant_created
            ON deals(tenant_id, created_at DESC);
          Build time: 31 minutes. Zero locks. Zero downtime.

  Step 2: CREATE INDEX CONCURRENTLY idx_deals_tenant_status
            ON deals(tenant_id, status, assigned_to);
          Build time: 28 minutes.

  Step 3: CREATE INDEX CONCURRENTLY idx_companies_tenant
            ON companies(tenant_id);
          Build time: 4 minutes. (Smaller table.)

  Step 4: Rewrote dashboard to use a materialized view
          refreshed every 15 minutes via pg_cron.

  Step 5: Re-deployed the dashboard. 11ms query time. Ship it.

The composite index on (tenant_id, created_at DESC) was the key decision. Postgres uses it for the WHERE tenant_id = $1 filter and the ORDER BY month DESC clause in one index scan, with no separate sort step.

But indexing alone wasn't enough for our largest tenants. Even with the index, aggregating 600,000 rows per request was too slow during traffic spikes. The real fix was moving to a pre-aggregated materialized view.

materialized_view_migration.sql

-- Pre-aggregate deal stats into a materialized view
CREATE MATERIALIZED VIEW deal_monthly_summary AS
SELECT
    d.tenant_id,
    u.id                                  AS user_id,
    u.name                                AS rep_name,
    DATE_TRUNC('month', d.created_at)     AS month,
    COUNT(*)                              AS total_deals,
    COUNT(*) FILTER (WHERE d.status = 'won')  AS won,
    COUNT(*) FILTER (WHERE d.status = 'lost') AS lost,
    SUM(d.value)                          AS pipeline_value
FROM deals d
JOIN users u ON u.id = d.assigned_to
GROUP BY
    d.tenant_id,
    u.id,
    u.name,
    DATE_TRUNC('month', d.created_at)
WITH DATA;

-- Fast tenant lookups on the view (tiny table — ~50K rows total)
CREATE UNIQUE INDEX ON deal_monthly_summary(tenant_id, user_id, month);

-- Refresh every 15 minutes via pg_cron (stale data is acceptable)
SELECT cron.schedule(
    'refresh-deal-summary',
    '*/15 * * * *',
    'REFRESH MATERIALIZED VIEW CONCURRENTLY deal_monthly_summary;'
);

The dashboard now queries the materialized view. Roughly 50,000 pre-aggregated rows instead of 8.3 million raw rows. For our largest enterprise tenant, query time dropped from 44.3 seconds to 11ms.

What we changed forever

This incident permanently changed how we think about database work on the team. The process changes that stuck:

Every new query gets EXPLAIN ANALYZE against a prod-sized dataset. We maintain a staging environment seeded with anonymised production data specifically for this. If you can't read and explain the query plan, the query doesn't ship.
Multi-tenant tables always get composite indexes starting with tenant_id. It's item one on our database schema checklist. No exceptions.
Aggregation queries over large datasets use materialized views. If you're GROUP BY-ing more than roughly 50K rows on any user-facing endpoint, you need a pre-aggregation strategy. Real-time feels nice. A 44-second query does not.
All production index creation uses CONCURRENTLY. We have a migration linter that rejects any SQL file containing bare CREATE INDEX.
Mobile retry logic has circuit breakers. Exponential backoff with jitter, max 3 retries, stop after 5 consecutive failures. Retry storms are a design choice.

44s Query time before

11ms Query time after

4,000× Performance improvement

47 min Total downtime

The rule I live by now

"If your query works fine in development, that means nothing. Development has no data."

The hardest part of the incident wasn't the rollback, or the Slack thread, or writing the status page update. It was knowing it was entirely preventable. One EXPLAIN ANALYZE run against a realistic dataset would have caught the missing index before it touched a single production user.

That check is now mandatory. On the PR checklist, in CI for schema changes, in the onboarding docs. A four-line SQL query that takes 44 seconds isn't a four-line problem. It's an 8.3 million row problem wearing a four-line disguise.