How a Redis Connection Leak Crashed Our AWS ECS Cluster at 3AM
At 3:12 AM on a Tuesday, PagerDuty fired for our primary AWS ECS cluster. Load balancer health checks were failing across all three production tasks. Within four minutes, the React SSR service was returning 502s to 100% of traffic — roughly 2,100 active users mid-session. We scaled from 3 to 6 ECS tasks. Every new task died within 90 seconds. The outage ran for 47 minutes before we understood what was happening.
The 3AM Alert: Tasks Dying Faster Than We Could Replace Them
The first CloudWatch alarm showed api-ssr-prod with 0 healthy tasks. The ALB target group listed every target as unhealthy. The health endpoint — /api/health, HTTP 200 — wasn't responding in time.
Instinct said traffic spike, so we triggered a horizontal scale-out:
aws ecs update-service --cluster prod-cluster --service api-ssr-prod --desired-count 6
Each new task launched, climbed to 2,048 MB of memory, and was OOM-killed before completing a single health check cycle. The scale-out made things actively worse — six dying tasks instead of three.
False Assumptions: We Blamed Everything Except the Code
Hypothesis one: AWS infrastructure fault. Us-east-1 had a partial EBS degradation event two months prior — muscle memory kicked in. The AWS Service Health Dashboard showed green across all services.
Hypothesis two: a memory regression from the deploy six hours earlier. We rolled back to the previous task definition. Tasks still hit 2,048 MB and died.
Hypothesis three: a traffic anomaly or DDoS. Request rate at 3 AM was 340 req/s — normal for that hour. CloudFront and Route 53 logs showed nothing unusual.
Twenty-eight minutes elapsed chasing these. The real problem had been running silently since 9:08 PM — six hours before the outage began.
CloudWatch Container Insights: Six Hours of Steady Climb
Pulling the MemoryUtilization metric for the prior 12 hours revealed a perfectly linear slope — starting at 240 MB after the 9:08 PM deploy and climbing at a constant 5.3 MB/min until hitting the 2,048 MB hard limit at 3:12 AM. No spike. No anomaly. A leak.
ECS Task Memory (MB) — 12-Hour Window
─────────────────────────────────────────────────────────────
2048 | ████ <- OOM KILL
| ████
1536 | ████
| ████
1024 | ████
| ████
512 | ████
| ████
240 |████████████ <- deploy at 9:08 PM
└────────────────────────────────────────────────────────
9PM 10PM 11PM 12AM 1AM 2AM 3AM
^-- outage
─────────────────────────────────────────────────────────────
Slope: +5.3 MB/min Duration: 344 min Tasks: 3 (all same)
Three tasks showed the exact same curve with no divergence — ruling out a per-task anomaly. Node.js heap metrics (from process.memoryUsage() exposed on /metrics) stayed flat at ~180 MB. The growing memory was not the V8 heap. It was native OS handles.
We ran INFO clients against the production Redis instance:
$ redis-cli -u ${REDIS_URL} INFO clients
# Clients
connected_clients:8847
blocked_clients:0
tracking_clients:0
clients_in_timeout_table:0
A service with 3 ECS tasks should have had 3 Redis connections. It had 8,847.
Root Cause: createClient() Called on Every SSR Request
Six hours earlier, a developer had added server-side caching to a product listing page. The Redis client was instantiated inside the async function rather than at module level:
// BAD: runs on EVERY request — new client, new TCP+TLS socket, never closed
export async function getServerSideProps() {
const { createClient } = await import('redis');
const client = createClient({
url: process.env.REDIS_URL,
socket: { tls: true },
});
await client.connect();
const cached = await client.get('products:all');
// client.disconnect() never called — function returns, local var gc'd
// but the OS socket handle is NEVER released
return {
props: { products: cached ? JSON.parse(cached) : [] },
};
}
createClient() was called on every SSR request. With TLS, each call opened a TCP connection and negotiated an SSL context. The local client variable went out of scope when the function returned — but Node.js GC cannot release OS file descriptors. The socket persisted until Redis chose to close it. Production Redis had timeout 0 (disabled) to prevent unexpected drops on long-running background jobs.
At 340 req/s with ~80 ms SSR latency, roughly 27 requests ran concurrently. Over 344 minutes: ~7.0 million requests, ~8,847 leaked connections, and 5.3 MB/min of accumulated OS socket buffer and SSL context memory — exactly matching the CloudWatch slope.
BROKEN: New Redis Client Per SSR Request
═══════════════════════════════════════════════════════
Browser
|
v
ECS Task (Node.js process)
|
v
getServerSideProps()
|
+--> createClient() <-- NEW client every request
| client.connect() <-- NEW TCP+TLS socket opened
| client.get(key)
| return props
| [client goes out of scope]
| [OS socket NEVER closed] <-- LEAK
v
Redis: 8,847 open connections
ECS: 2,048 MB hard limit --> OOM KILL
═══════════════════════════════════════════════════════
FIXED: Module-Level Singleton (1 connection per task)
═══════════════════════════════════════════════════════
[Module load]
|
+--> createClient() ONCE
| client.connect() ONCE
v
client singleton (shared across all requests)
|
| Browser
| |
| v
| ECS Task --> getServerSideProps()
| |
+<--------------------+ reuse existing client
|
v
Redis: 3 open connections (1 per task)
ECS: ~180 MB stable
═══════════════════════════════════════════════════════
Architecture Fix: Singleton Client with Cold-Start Guard
The fix was a module-level singleton with a concurrent-initialization guard — one client per Node.js process, initialized once, reused across all requests regardless of how many arrive during the initial cold start:
import { createClient, RedisClientType } from 'redis';
let client: RedisClientType | null = null;
let connectPromise: Promise<void> | null = null;
export async function getRedisClient(): Promise<RedisClientType> {
if (client?.isReady) return client;
// If a connect is already in flight, wait for it (thundering herd guard)
if (connectPromise) {
await connectPromise;
return client!;
}
client = createClient({
url: process.env.REDIS_URL,
socket: {
tls: process.env.NODE_ENV === 'production',
reconnectStrategy: (retries) => Math.min(retries * 50, 2000),
},
});
client.on('error', (err) => console.error('[Redis] error:', err));
connectPromise = client
.connect()
.finally(() => { connectPromise = null; });
await connectPromise;
return client;
}
// Graceful drain — ECS sends SIGTERM before killing the task
process.on('SIGTERM', async () => {
if (client?.isReady) await client.disconnect();
});
Why a singleton over a connection pool? Redis is single-threaded — one connection handles concurrent pipelined commands efficiently. A pool adds overhead and additional open handles with no throughput benefit for SSR workloads. The connectPromise guard is critical: during ECS cold starts, multiple SSR requests can arrive before the first connection completes. Without it, each request races to call createClient() — recreating the exact pattern we just fixed.
We also tightened the ECS task definition memory envelope:
{
"memory": 1024,
"memoryReservation": 512,
"environment": [
{
"name": "NODE_OPTIONS",
"value": "--max-old-space-size=768"
}
]
}
Setting --max-old-space-size=768 explicitly caps the V8 heap and forces earlier GC cycles. Previously, Node.js defaulted to ~1.4 GB heap on a 2 GB container, leaving almost no headroom for native handles or the Next.js route cache before hitting the ECS hard limit. The 1,024 MB hard limit now sits 256 MB above the explicit V8 ceiling — enough headroom to trigger a CloudWatch alarm before an OOM kill.
Why Staging Didn't Catch This
Staging Redis had timeout 300 (the Redis default). Leaked connections were evicted every 5 minutes — memory never climbed enough to alarm. Production Redis had timeout 0 (never evict idle connections) — a deliberate setting to avoid dropping long-running background job connections. One config delta made the leak completely invisible in staging.
Our CI load test ran for 90 seconds. A 5.3 MB/min leak produces 8 MB over 90 seconds — undetectable against normal variance. The same test run for 15 minutes would have shown 80 MB of growth and caught it immediately.
Lessons Learned
- Module-level singletons for all I/O clients — Redis, database connections, HTTP agents: initialize once at module load, never inside request handlers. Dynamic
import()inside async functions is especially dangerous because the module cache doesn't prevent re-initialization of module-level state. - Staging must mirror production Redis config —
timeout 0in production vstimeout 300in staging makes connection leak bugs completely invisible pre-deploy. Treat connection timeout config as a correctness concern, not just an ops preference. - Add a memory soak test to CI — A 15-minute constant-load test with a memory growth assertion (<5% increase over the final 5 minutes) would have caught this before merge. We added it the following sprint.
- Monitor Redis
connected_clientsas a canary metric — Client count should be flat relative to ECS task count, not proportional to request rate. A rising ratio is a connection leak, detectable hours before memory becomes critical. - Set ECS
memoryReservation+ CloudWatch alarm at 80% — Hard memory limits are silent killers. A soft reservation plus an alarm at 80% of the hard limit gives you a window to diagnose before the OOM kill fires.
The 47-minute outage cost us a postmortem, a Redis monitoring dashboard, a soak test in CI, and a team convention: no I/O client initialization inside request handlers, ever. We added an ESLint rule to flag createClient calls inside async functions. It's caught two similar patterns in the three months since.