ECS Autoscaling Fought Our Postgres max_connections at 2AM and Postgres Won
At 2:14 AM on a Saturday, PagerDuty woke me up. Our API error rate had gone from 0% to 78% in under four minutes.
Eleven thousand users were mid-checkout during a flash sale we'd announced via email six hours earlier.
The error was one I'd never seen in production before: remaining connection slots are reserved for non-replication superuser connections.
It took us 2.5 hours to fully understand what happened, and the root cause was embarrassingly simple math we'd never done.
Production Failure
The timeline was brutal in its speed. At 2:10 AM, the flash sale promo email hit 80,000 inboxes. By 2:12 AM, traffic had tripled. ECS autoscaling kicked in — exactly as designed. By 2:14 AM, the first health checks started failing. By 2:18 AM, 78% of API requests were returning 500s.
CloudWatch showed ECS scaling from 5 tasks to 38 tasks in roughly 12 minutes. The p99 latency went from 180ms to 30 seconds (our timeout limit) and then to complete connection refusals. Approximately 14,200 USD in transactions failed or were abandoned. We rolled back to 5 tasks manually and restored service at 4:42 AM — 2 hours and 28 minutes after the first alert.
TIMELINE OF COLLAPSE 02:10 AM Promo email delivered → traffic 3× normal 02:12 AM ECS autoscaling triggers [5 tasks → scaling up] 02:14 AM First health check fails [error rate: 12%] 02:16 AM New tasks can't connect [error rate: 41%] 02:18 AM Connection slots exhausted [error rate: 78%] 02:19 AM Old tasks start failing too [error rate: 94%] 02:22 AM On-call engineer joins bridge 04:42 AM Service restored (manual rollback + pool resize) Users affected: ~11,400 Failed revenue: ~$14,200 Time to resolve: 2h 28m
False Assumptions
Our autoscaling configuration looked fine on paper. We'd set sensible CPU and memory thresholds.
We'd tested deployments under load. We'd validated that individual tasks were healthy.
What we'd never done was treat max_connections as a hard ceiling that every new task competed for.
The assumption baked into our infrastructure was: more tasks handle more traffic.
That's true in a stateless world. But every Node.js task we ran used knex
with a connection pool, and every pool was configured with the same environment variable:
PG_POOL_SIZE=10. We'd set that once, years ago, and never revisited it.
It was a reasonable number for 5 tasks. It became catastrophic for 38.
We also assumed our RDS instance had been sized generously. We'd upgraded it from
db.t3.small to db.t3.medium six months earlier for performance reasons.
Nobody re-checked what that meant for max_connections.
On AWS RDS, that value is formula-driven: LEAST(DBInstanceClassMemory / 9531392, 5000).
A db.t3.medium has 4 GB of RAM — giving it a max_connections of roughly 170.
Investigation
The error message itself was the first real clue.
remaining connection slots are reserved for non-replication superuser connections
is Postgres's way of saying: we're out of connections. Not slow — completely exhausted.
I queried pg_stat_activity from an admin connection (thankfully Postgres reserves
3 slots for superusers by default):
SELECT count(*), state, wait_event_type
FROM pg_stat_activity
WHERE datname = 'proddb'
GROUP BY state, wait_event_type
ORDER BY count DESC;
-- count | state | wait_event_type
-- -------+--------+-----------------
-- 167 | active | Client
-- 0 | idle |
-- 0 | idle |
-- (3 rows)
SELECT setting FROM pg_settings WHERE name = 'max_connections';
-- 170
All 167 available slots were in use. Meanwhile, ECS was trying to start new tasks and each one
needed at least 1 connection to pass its health check — which hit a /healthz endpoint
that ran a SELECT 1 query. They couldn't get a connection, so they failed health checks,
so ECS terminated them and started fresh ones. Which also couldn't connect.
The autoscaler was in a death loop, consuming connection attempts without ever succeeding.
When existing tasks then tried to acquire new pool connections — as idle ones aged out — they also started failing. Within minutes, even the healthy tasks were rejecting requests.
THE CONNECTION MATH (why it broke) ECS tasks: 38 Pool size/task: × 10 ───────────────────── Total attempted: 380 connections RDS db.t3.medium max_connections: 170 Reserved for superuser: - 3 ───────────────────────────────────────── Available to app: 167 Overflow: 380 - 167 = 213 connections REFUSED New tasks → health check fails → ECS cycles → repeat Old tasks → pool refresh fails → requests error → cascade
Root Cause
The root cause was a missing invariant: we had never encoded the constraint
MAX_TASKS × POOL_SIZE < max_connections anywhere — not in code, not in documentation,
not in infrastructure-as-code, not in alerts. It wasn't in our runbooks.
It wasn't in our autoscaling configuration. It simply didn't exist as a concept in our system design.
Compounding this: we'd upgraded the RDS instance class for performance, not knowing that
max_connections is RAM-proportional. When we went from db.t3.small (2 GB)
to db.t3.medium (4 GB), max_connections doubled — from ~85 to ~170.
That felt like plenty. We never wrote down the number or checked it against our autoscaling ceiling.
The specific cascade that made it so bad: ECS's default health check grace period was 30 seconds. During those 30 seconds, a new task held open connection pool slots even while failing health checks. So each doomed task was consuming connections for 30 seconds before ECS killed it and tried again. That meant we always had a large number of "zombie" connection attempts draining the pool.
The Fix
The emergency fix was manual: reduce PG_POOL_SIZE to 2, set ECS desired count back to 5,
and let the task churn settle. Service was restored 8 minutes after we made those changes.
The permanent fix had three parts. First, we added PgBouncer as a connection pooler in transaction pooling mode between our ECS tasks and RDS:
[databases]
proddb = host=rds-endpoint.us-east-1.rds.amazonaws.com port=5432 dbname=proddb
[pgbouncer]
listen_port = 5432
listen_addr = 0.0.0.0
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
; Transaction pooling: connection released after each transaction
pool_mode = transaction
; Max connections PgBouncer holds to Postgres
server_pool_size = 80
max_db_connections = 100
; Client-facing: up to 1000 app connections multiplexed
max_client_conn = 1000
; Keep some connections warm
min_pool_size = 5
reserve_pool_size = 10
With PgBouncer, our ECS tasks connect to the pooler (not directly to RDS). PgBouncer maintains at most 100 real Postgres connections regardless of how many ECS tasks exist. In transaction mode, a server connection is only held for the duration of a transaction — idle app connections consume zero server connections.
Second, we added a startup assertion in our Node.js app that fails fast if the math is wrong:
const POOL_SIZE = parseInt(process.env.PG_POOL_SIZE ?? '5', 10);
const MAX_ECS_TASKS = parseInt(process.env.ECS_MAX_TASKS ?? '50', 10);
const PG_MAX_CONNECTIONS = parseInt(process.env.PG_MAX_CONNECTIONS ?? '170', 10);
// Fail loudly at startup rather than silently at 2AM
const worstCaseConnections = MAX_ECS_TASKS * POOL_SIZE;
if (worstCaseConnections >= PG_MAX_CONNECTIONS * 0.8) {
throw new Error(
`Connection math unsafe: ${MAX_ECS_TASKS} tasks × ${POOL_SIZE} pool = ` +
`${worstCaseConnections} connections ≥ 80% of max (${PG_MAX_CONNECTIONS}). " +
`Reduce PG_POOL_SIZE or add PgBouncer.`
);
}
const pool = knex({
client: 'pg',
connection: { host: process.env.PGBOUNCER_HOST, /* ... */ },
pool: { min: 1, max: POOL_SIZE },
});
Third, we added a CloudWatch alarm on the RDS DatabaseConnections metric —
alerting at 70% of max_connections so we'd know long before saturation.
ARCHITECTURE: BEFORE vs AFTER
BEFORE
──────────────────────────────────────
[ECS Task 1]─┐
[ECS Task 2]─┤ (each: 10 direct connections)
[ECS Task 3]─┼─────────────────▶ [RDS Postgres]
... │ max_conn: 170
[ECS Task 38]┘
Total attempted: 380 → EXHAUSTED
AFTER
──────────────────────────────────────
[ECS Task 1] ─┐
[ECS Task 2] ─┤ (each: 5 connections
[ECS Task 3] ─┤ to PgBouncer)
... ├──▶ [PgBouncer]──────▶ [RDS Postgres]
[ECS Task N] ─┘ max_client: 1000 server_pool: 100
tx pooling mode max_conn: 170
Postgres sees ≤ 100 connections regardless of ECS scale
Lessons Learned
Every shared resource needs a capacity formula.
Autoscaling changes the multiplier on your resource consumption.
Database connections, Redis connection limits, third-party API rate limits —
anything shared across tasks needs a hard invariant:
MAX_TASKS × PER_TASK_USAGE < RESOURCE_LIMIT.
Write it down. Encode it as a startup assertion. Alert on it at 70%.
RDS instance resizes silently change max_connections.
When you resize an RDS instance for performance, max_connections changes too —
and not necessarily in the direction you'd expect if you're used to thinking about it as a config value.
After every RDS instance class change, re-run:
SELECT setting FROM pg_settings WHERE name = 'max_connections';
and update your capacity math.
PgBouncer is not optional for ECS/Kubernetes workloads. When your task count is dynamic, direct Postgres connections are a liability. PgBouncer in transaction pooling mode decouples app-layer concurrency from Postgres server connections. It's a 30-minute setup that prevents exactly this class of incident.
Health check design matters during connection exhaustion.
Our /healthz endpoint ran a SELECT 1 — which needs a DB connection.
During connection exhaustion, this meant every new task burned connections just to confirm it was unhealthy.
We split our health check: a shallow /ping (no DB) for ECS health,
and a deeper /ready (with DB check) used only during deployments.
The most dangerous outages are math problems, not code bugs. There was no bug in our application code. There was no misconfiguration in Postgres. The system worked exactly as designed — we just hadn't done the arithmetic before enabling autoscaling. After this incident, I added a capacity planning section to every service's runbook: what are the shared resources, what's the per-task usage, and what's the ceiling?