March 16, 2026Database9 min read

ECS Autoscaling Fought Our Postgres max_connections at 2AM and Postgres Won

Published March 16, 20269 min read

2:14 AM, Saturday. PagerDuty woke me up. API error rate had gone from 0% to 78% in under four minutes. Eleven thousand users were mid-checkout during a flash sale we'd announced via email six hours earlier. The error was one I'd never seen in production: remaining connection slots are reserved for non-replication superuser connections. It took us 2.5 hours to fully understand what happened. The root cause was embarrassingly simple math we'd never done.

Production failure

The timeline was brutal in its speed. 2:10 AM, the flash sale promo email hit 80,000 inboxes. 2:12 AM, traffic had tripled. ECS autoscaling kicked in exactly as designed. 2:14 AM, health checks started failing. 2:18 AM, 78% of API requests were returning 500s.

CloudWatch showed ECS scaling from 5 tasks to 38 tasks in roughly 12 minutes. P99 latency went from 180ms to 30 seconds (our timeout limit) and then to complete connection refusals. Roughly $14,200 in transactions failed or were abandoned. We manually rolled back to 5 tasks and restored service at 4:42 AM. 2 hours and 28 minutes after the first alert.

  TIMELINE OF COLLAPSE

  02:10 AM  Promo email delivered → traffic 3× normal
  02:12 AM  ECS autoscaling triggers  [5 tasks → scaling up]
  02:14 AM  First health check fails  [error rate: 12%]
  02:16 AM  New tasks can't connect   [error rate: 41%]
  02:18 AM  Connection slots exhausted [error rate: 78%]
  02:19 AM  Old tasks start failing too [error rate: 94%]
  02:22 AM  On-call engineer joins bridge
  04:42 AM  Service restored (manual rollback + pool resize)

  Users affected:  ~11,400
  Failed revenue:  ~$14,200
  Time to resolve: 2h 28m

False assumptions

Our autoscaling configuration looked fine on paper. Sensible CPU and memory thresholds, tested deployments under load, validated individual task health. What we'd never done was treat max_connections as a hard ceiling that every new task competed for.

The assumption baked into our infra was simple: more tasks, more traffic handled. That's true in a stateless world. Every Node.js task we ran used knex with a connection pool, and every pool was configured with the same environment variable, PG_POOL_SIZE=10, which we'd set once years ago and never touched. Reasonable for 5 tasks. Catastrophic for 38.

We also assumed the RDS instance was sized generously. Six months earlier we'd upgraded it from db.t3.small to db.t3.medium for performance. Nobody re-checked what that meant for max_connections. On RDS, that value is formula-driven: LEAST(DBInstanceClassMemory / 9531392, 5000). A db.t3.medium has 4 GB of RAM, giving it a max_connections of about 170. I did not know any of this at 2:14 AM.

Investigation

The error message itself was the first real clue. remaining connection slots are reserved for non-replication superuser connections is Postgres's way of saying it's completely out of connections. Not slow. Exhausted.

I queried pg_stat_activity from an admin connection (thankfully Postgres reserves 3 slots for superusers by default):

sql

SELECT count(*), state, wait_event_type
FROM pg_stat_activity
WHERE datname = 'proddb'
GROUP BY state, wait_event_type
ORDER BY count DESC;

--  count | state  | wait_event_type
-- -------+--------+-----------------
--    167 | active | Client
--      0 | idle   |
--      0 | idle   |
-- (3 rows)

SELECT setting FROM pg_settings WHERE name = 'max_connections';
-- 170

All 167 available slots were in use. Meanwhile ECS was trying to start new tasks and each one needed at least 1 connection to pass its health check, which hit a /healthz endpoint that ran a SELECT 1. They couldn't get a connection, so they failed health checks, so ECS terminated them and started fresh ones, which also couldn't connect. The autoscaler was in a death loop, burning connection attempts without ever succeeding.

When existing tasks then tried to acquire new pool connections as idle ones aged out, they started failing too. Within minutes even the healthy tasks were rejecting requests.

  THE CONNECTION MATH (why it broke)

  ECS tasks:         38
  Pool size/task:  × 10
  ─────────────────────
  Total attempted: 380 connections

  RDS db.t3.medium max_connections:     170
  Reserved for superuser:              -  3
  ─────────────────────────────────────────
  Available to app:                     167

  Overflow:   380 - 167 = 213 connections REFUSED

  New tasks → health check fails → ECS cycles → repeat
  Old tasks → pool refresh fails → requests error → cascade

Root cause

The root cause was a missing invariant. We had never encoded the constraint MAX_TASKS × POOL_SIZE < max_connections anywhere. Not in code, docs, IaC, or alerts. Not in runbooks. Not in the autoscaling config. It simply didn't exist as a concept in our system design.

Compounding this: we'd upgraded RDS for performance, not knowing max_connections is RAM-proportional. Going from db.t3.small (2 GB) to db.t3.medium (4 GB) doubled max_connections from ~85 to ~170. That felt like plenty. We never wrote the number down or checked it against the autoscaling ceiling.

The cascade was worse because ECS's default health check grace period was 30 seconds. During those 30 seconds, a new task held pool slots even while failing health checks. Each doomed task burned connections for 30 seconds before ECS killed it and tried again. There were always "zombie" connection attempts draining the pool.

The fix

The emergency fix was manual. Reduce PG_POOL_SIZE to 2, set ECS desired count back to 5, let the task churn settle. Service was restored 8 minutes after those changes.

The permanent fix had three parts. First, we added PgBouncer as a connection pooler in transaction pooling mode between ECS tasks and RDS.

ini — pgbouncer.ini

[databases]
proddb = host=rds-endpoint.us-east-1.rds.amazonaws.com port=5432 dbname=proddb

[pgbouncer]
listen_port = 5432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

; Transaction pooling: connection released after each transaction
pool_mode = transaction

; Max connections PgBouncer holds to Postgres
server_pool_size = 80
max_db_connections = 100

; Client-facing: up to 1000 app connections multiplexed
max_client_conn = 1000

; Keep some connections warm
min_pool_size = 5
reserve_pool_size = 10

With PgBouncer, our ECS tasks connect to the pooler instead of directly to RDS. PgBouncer maintains at most 100 real Postgres connections regardless of how many ECS tasks exist. In transaction mode, a server connection is only held for the duration of a transaction. Idle app connections consume zero server connections.

Second, we added a startup assertion in our Node.js app that fails fast if the math is wrong:

typescript — src/db/index.ts

const POOL_SIZE = parseInt(process.env.PG_POOL_SIZE ?? '5', 10);
const MAX_ECS_TASKS = parseInt(process.env.ECS_MAX_TASKS ?? '50', 10);
const PG_MAX_CONNECTIONS = parseInt(process.env.PG_MAX_CONNECTIONS ?? '170', 10);

// Fail loudly at startup rather than silently at 2AM
const worstCaseConnections = MAX_ECS_TASKS * POOL_SIZE;
if (worstCaseConnections >= PG_MAX_CONNECTIONS * 0.8) {
  throw new Error(
    `Connection math unsafe: ${MAX_ECS_TASKS} tasks × ${POOL_SIZE} pool = ` +
    `${worstCaseConnections} connections ≥ 80% of max (${PG_MAX_CONNECTIONS}). " +
    `Reduce PG_POOL_SIZE or add PgBouncer.`
  );
}

const pool = knex({
  client: 'pg',
  connection: { host: process.env.PGBOUNCER_HOST, /* ... */ },
  pool: { min: 1, max: POOL_SIZE },
});

Third, we added a CloudWatch alarm on the RDS DatabaseConnections metric, firing at 70% of max_connections, so we'd know long before saturation.

  ARCHITECTURE: BEFORE vs AFTER

  BEFORE
  ──────────────────────────────────────
  [ECS Task 1]─┐
  [ECS Task 2]─┤  (each: 10 direct connections)
  [ECS Task 3]─┼─────────────────▶ [RDS Postgres]
  ...          │                    max_conn: 170
  [ECS Task 38]┘
  Total attempted: 380 → EXHAUSTED

  AFTER
  ──────────────────────────────────────
  [ECS Task 1] ─┐
  [ECS Task 2] ─┤  (each: 5 connections
  [ECS Task 3] ─┤   to PgBouncer)
  ...           ├──▶ [PgBouncer]──────▶ [RDS Postgres]
  [ECS Task N] ─┘    max_client: 1000   server_pool: 100
                     tx pooling mode    max_conn: 170

  Postgres sees ≤ 100 connections regardless of ECS scale

Lessons learned

Every shared resource needs a capacity formula. Autoscaling changes the multiplier on your resource consumption. Database connections, Redis connection limits, third-party API rate limits, anything shared across tasks needs a hard invariant: MAX_TASKS × PER_TASK_USAGE < RESOURCE_LIMIT. Write it down. Encode it as a startup assertion. Alert on it at 70%.

RDS instance resizes silently change max_connections. When you resize for performance, the value changes with RAM, not in the direction you'd expect if you think of it as a config value. After every instance class change, run SELECT setting FROM pg_settings WHERE name = 'max_connections'; and update your capacity math.

PgBouncer is not optional for ECS or Kubernetes workloads. When task count is dynamic, direct Postgres connections are a liability. Transaction pooling mode decouples app-layer concurrency from Postgres server connections. It's a 30-minute setup that prevents exactly this class of incident.

Health check design matters during connection exhaustion. Our /healthz endpoint ran a SELECT 1, which needs a DB connection. During exhaustion, every new task burned connections just to confirm it was unhealthy. We split the health check: a shallow /ping (no DB) for ECS health, and a deeper /ready (with DB) used only during deployments.

The most dangerous outages are math problems, not code bugs. There was no bug in our application code. There was no misconfiguration in Postgres. The system worked exactly as designed. We just hadn't done the arithmetic before enabling autoscaling. I added a capacity planning section to every service's runbook after this. Shared resources, per-task usage, ceiling.