ECS Autoscaling Fought Our Postgres max_connections at 2AM and Postgres Won
← Back
March 16, 2026Database9 min read

ECS Autoscaling Fought Our Postgres max_connections at 2AM and Postgres Won

Published March 16, 20269 min read

At 2:14 AM on a Saturday, PagerDuty woke me up. Our API error rate had gone from 0% to 78% in under four minutes. Eleven thousand users were mid-checkout during a flash sale we'd announced via email six hours earlier. The error was one I'd never seen in production before: remaining connection slots are reserved for non-replication superuser connections. It took us 2.5 hours to fully understand what happened, and the root cause was embarrassingly simple math we'd never done.


Production Failure

The timeline was brutal in its speed. At 2:10 AM, the flash sale promo email hit 80,000 inboxes. By 2:12 AM, traffic had tripled. ECS autoscaling kicked in — exactly as designed. By 2:14 AM, the first health checks started failing. By 2:18 AM, 78% of API requests were returning 500s.

CloudWatch showed ECS scaling from 5 tasks to 38 tasks in roughly 12 minutes. The p99 latency went from 180ms to 30 seconds (our timeout limit) and then to complete connection refusals. Approximately 14,200 USD in transactions failed or were abandoned. We rolled back to 5 tasks manually and restored service at 4:42 AM — 2 hours and 28 minutes after the first alert.

  TIMELINE OF COLLAPSE

  02:10 AM  Promo email delivered → traffic 3× normal
  02:12 AM  ECS autoscaling triggers  [5 tasks → scaling up]
  02:14 AM  First health check fails  [error rate: 12%]
  02:16 AM  New tasks can't connect   [error rate: 41%]
  02:18 AM  Connection slots exhausted [error rate: 78%]
  02:19 AM  Old tasks start failing too [error rate: 94%]
  02:22 AM  On-call engineer joins bridge
  04:42 AM  Service restored (manual rollback + pool resize)

  Users affected:  ~11,400
  Failed revenue:  ~$14,200
  Time to resolve: 2h 28m

False Assumptions

Our autoscaling configuration looked fine on paper. We'd set sensible CPU and memory thresholds. We'd tested deployments under load. We'd validated that individual tasks were healthy. What we'd never done was treat max_connections as a hard ceiling that every new task competed for.

The assumption baked into our infrastructure was: more tasks handle more traffic. That's true in a stateless world. But every Node.js task we ran used knex with a connection pool, and every pool was configured with the same environment variable: PG_POOL_SIZE=10. We'd set that once, years ago, and never revisited it. It was a reasonable number for 5 tasks. It became catastrophic for 38.

We also assumed our RDS instance had been sized generously. We'd upgraded it from db.t3.small to db.t3.medium six months earlier for performance reasons. Nobody re-checked what that meant for max_connections. On AWS RDS, that value is formula-driven: LEAST(DBInstanceClassMemory / 9531392, 5000). A db.t3.medium has 4 GB of RAM — giving it a max_connections of roughly 170.

Investigation

The error message itself was the first real clue. remaining connection slots are reserved for non-replication superuser connections is Postgres's way of saying: we're out of connections. Not slow — completely exhausted.

I queried pg_stat_activity from an admin connection (thankfully Postgres reserves 3 slots for superusers by default):

sql
SELECT count(*), state, wait_event_type
FROM pg_stat_activity
WHERE datname = 'proddb'
GROUP BY state, wait_event_type
ORDER BY count DESC;

--  count | state  | wait_event_type
-- -------+--------+-----------------
--    167 | active | Client
--      0 | idle   |
--      0 | idle   |
-- (3 rows)

SELECT setting FROM pg_settings WHERE name = 'max_connections';
-- 170

All 167 available slots were in use. Meanwhile, ECS was trying to start new tasks and each one needed at least 1 connection to pass its health check — which hit a /healthz endpoint that ran a SELECT 1 query. They couldn't get a connection, so they failed health checks, so ECS terminated them and started fresh ones. Which also couldn't connect. The autoscaler was in a death loop, consuming connection attempts without ever succeeding.

When existing tasks then tried to acquire new pool connections — as idle ones aged out — they also started failing. Within minutes, even the healthy tasks were rejecting requests.

  THE CONNECTION MATH (why it broke)

  ECS tasks:         38
  Pool size/task:  × 10
  ─────────────────────
  Total attempted: 380 connections

  RDS db.t3.medium max_connections:     170
  Reserved for superuser:              -  3
  ─────────────────────────────────────────
  Available to app:                     167

  Overflow:   380 - 167 = 213 connections REFUSED

  New tasks → health check fails → ECS cycles → repeat
  Old tasks → pool refresh fails → requests error → cascade

Root Cause

The root cause was a missing invariant: we had never encoded the constraint MAX_TASKS × POOL_SIZE < max_connections anywhere — not in code, not in documentation, not in infrastructure-as-code, not in alerts. It wasn't in our runbooks. It wasn't in our autoscaling configuration. It simply didn't exist as a concept in our system design.

Compounding this: we'd upgraded the RDS instance class for performance, not knowing that max_connections is RAM-proportional. When we went from db.t3.small (2 GB) to db.t3.medium (4 GB), max_connections doubled — from ~85 to ~170. That felt like plenty. We never wrote down the number or checked it against our autoscaling ceiling.

The specific cascade that made it so bad: ECS's default health check grace period was 30 seconds. During those 30 seconds, a new task held open connection pool slots even while failing health checks. So each doomed task was consuming connections for 30 seconds before ECS killed it and tried again. That meant we always had a large number of "zombie" connection attempts draining the pool.

The Fix

The emergency fix was manual: reduce PG_POOL_SIZE to 2, set ECS desired count back to 5, and let the task churn settle. Service was restored 8 minutes after we made those changes.

The permanent fix had three parts. First, we added PgBouncer as a connection pooler in transaction pooling mode between our ECS tasks and RDS:

ini — pgbouncer.ini
[databases]
proddb = host=rds-endpoint.us-east-1.rds.amazonaws.com port=5432 dbname=proddb

[pgbouncer]
listen_port = 5432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

; Transaction pooling: connection released after each transaction
pool_mode = transaction

; Max connections PgBouncer holds to Postgres
server_pool_size = 80
max_db_connections = 100

; Client-facing: up to 1000 app connections multiplexed
max_client_conn = 1000

; Keep some connections warm
min_pool_size = 5
reserve_pool_size = 10

With PgBouncer, our ECS tasks connect to the pooler (not directly to RDS). PgBouncer maintains at most 100 real Postgres connections regardless of how many ECS tasks exist. In transaction mode, a server connection is only held for the duration of a transaction — idle app connections consume zero server connections.

Second, we added a startup assertion in our Node.js app that fails fast if the math is wrong:

typescript — src/db/index.ts
const POOL_SIZE = parseInt(process.env.PG_POOL_SIZE ?? '5', 10);
const MAX_ECS_TASKS = parseInt(process.env.ECS_MAX_TASKS ?? '50', 10);
const PG_MAX_CONNECTIONS = parseInt(process.env.PG_MAX_CONNECTIONS ?? '170', 10);

// Fail loudly at startup rather than silently at 2AM
const worstCaseConnections = MAX_ECS_TASKS * POOL_SIZE;
if (worstCaseConnections >= PG_MAX_CONNECTIONS * 0.8) {
  throw new Error(
    `Connection math unsafe: ${MAX_ECS_TASKS} tasks × ${POOL_SIZE} pool = ` +
    `${worstCaseConnections} connections ≥ 80% of max (${PG_MAX_CONNECTIONS}). " +
    `Reduce PG_POOL_SIZE or add PgBouncer.`
  );
}

const pool = knex({
  client: 'pg',
  connection: { host: process.env.PGBOUNCER_HOST, /* ... */ },
  pool: { min: 1, max: POOL_SIZE },
});

Third, we added a CloudWatch alarm on the RDS DatabaseConnections metric — alerting at 70% of max_connections so we'd know long before saturation.

  ARCHITECTURE: BEFORE vs AFTER

  BEFORE
  ──────────────────────────────────────
  [ECS Task 1]─┐
  [ECS Task 2]─┤  (each: 10 direct connections)
  [ECS Task 3]─┼─────────────────▶ [RDS Postgres]
  ...          │                    max_conn: 170
  [ECS Task 38]┘
  Total attempted: 380 → EXHAUSTED

  AFTER
  ──────────────────────────────────────
  [ECS Task 1] ─┐
  [ECS Task 2] ─┤  (each: 5 connections
  [ECS Task 3] ─┤   to PgBouncer)
  ...           ├──▶ [PgBouncer]──────▶ [RDS Postgres]
  [ECS Task N] ─┘    max_client: 1000   server_pool: 100
                     tx pooling mode    max_conn: 170

  Postgres sees ≤ 100 connections regardless of ECS scale

Lessons Learned

Every shared resource needs a capacity formula. Autoscaling changes the multiplier on your resource consumption. Database connections, Redis connection limits, third-party API rate limits — anything shared across tasks needs a hard invariant: MAX_TASKS × PER_TASK_USAGE < RESOURCE_LIMIT. Write it down. Encode it as a startup assertion. Alert on it at 70%.

RDS instance resizes silently change max_connections. When you resize an RDS instance for performance, max_connections changes too — and not necessarily in the direction you'd expect if you're used to thinking about it as a config value. After every RDS instance class change, re-run: SELECT setting FROM pg_settings WHERE name = 'max_connections'; and update your capacity math.

PgBouncer is not optional for ECS/Kubernetes workloads. When your task count is dynamic, direct Postgres connections are a liability. PgBouncer in transaction pooling mode decouples app-layer concurrency from Postgres server connections. It's a 30-minute setup that prevents exactly this class of incident.

Health check design matters during connection exhaustion. Our /healthz endpoint ran a SELECT 1 — which needs a DB connection. During connection exhaustion, this meant every new task burned connections just to confirm it was unhealthy. We split our health check: a shallow /ping (no DB) for ECS health, and a deeper /ready (with DB check) used only during deployments.

The most dangerous outages are math problems, not code bugs. There was no bug in our application code. There was no misconfiguration in Postgres. The system worked exactly as designed — we just hadn't done the arithmetic before enabling autoscaling. After this incident, I added a capacity planning section to every service's runbook: what are the shared resources, what's the per-task usage, and what's the ceiling?

Share this
← All Posts9 min read