The Invisible Bottleneck: How One Sync Call Froze Our Node.js API
The alert came in on a Tuesday morning. Not a crash — we'd have preferred a crash. Instead, it was the worst kind of production problem: everything was still running, but it felt like the server was wading through concrete. P99 latency had climbed from 120ms to over 900ms overnight. No errors. No memory spikes. No CPU thrashing. Just slow.
This is the story of how a four-character function call — Sync — held our
Node.js API hostage for three weeks, and why understanding the event loop changed the way
I write backend code permanently.
The Setup
We were running a multi-tenant SaaS API — Express.js on Node 18, deployed on a DigitalOcean Kubernetes cluster, handling around 400–600 requests per second during peak hours. The stack was mature. We had APM, distributed tracing, and a solid alerting setup. We were not rookies.
Three weeks before the incident, we'd shipped a new feature: per-tenant feature flags. Instead of hardcoding feature availability, each API request would now check a JSON config to determine what the requesting tenant had access to. Simple, clean, elegant. Or so we thought.
The implementation lived in a utility module:
import fs from 'fs';
import path from 'path';
// Called on every authenticated request
export function getFeatureFlags(tenantId: string): FeatureFlags {
const configPath = path.join(__dirname, '../config/features.json');
// 🚨 This line. Right here. This is the problem.
const raw = fs.readFileSync(configPath, 'utf-8');
const config = JSON.parse(raw);
return config[tenantId] ?? config['default'];
}
In local development, the JSON file was 12KB. The function took about 0.8ms. With our test traffic of 5–10 req/sec, that was invisible. In production, the file had grown to 340KB as we onboarded more tenants. The function now took 4–8ms per call — and we were calling it 500 times per second.
Why This Breaks Everything (The Event Loop Explained)
To understand why this was catastrophic, you need to understand what Node.js actually is: a single-threaded runtime built around a non-blocking event loop. The event loop is the engine that makes Node fast — it allows one thread to handle thousands of concurrent connections by never waiting. It processes a request, delegates I/O to the OS, handles the next request, and comes back when the I/O is done.
The critical rule: never block the event loop. When you call a synchronous function, the entire loop stops. Nothing else runs. Every other in-flight request freezes in place until your synchronous operation completes.
NODE.JS EVENT LOOP — HEALTHY (async I/O)
─────────────────────────────────────────────────────────
Request A ──► [Handler starts] ──► [Awaits DB] ──────────────► [Responds]
│ ▲
▼ │
Request B ──────────────────────► [Handler runs while A waits] ─────┘
Request C ──────────────────────────────────────────────────────► [Handler runs]
Event loop keeps spinning. No waiting. High throughput. ✓
─────────────────────────────────────────────────────────
NODE.JS EVENT LOOP — BLOCKED (sync I/O)
─────────────────────────────────────────────────────────
Request A ──► [readFileSync — blocks loop for 7ms] ──► [Responds]
│
▼
Request B ──► [WAITING ·················] ──► [Responds, 7ms late]
Request C ──► [WAITING ·······················] ──► [Responds, 7ms late]
Request D ──► [WAITING ·····························] ──► ...
Request E ──► [WAITING ···································] ──► ...
500 req/sec × 7ms block = event loop starved. ✗
Do the math: 500 requests per second, each blocking the loop for 7ms. That's 3,500ms of blocking per second. There are only 1,000ms in a second. The event loop was never, ever caught up. Every request was queueing behind every other request. P99 latency wasn't 900ms because the operation was slow — it was 900ms because requests were spending 780ms just waiting for their turn.
Three Weeks of Wrong Theories
Here's the part that still stings: we spent three weeks blaming the wrong things.
- Week 1: We blamed PostgreSQL. Added read replicas. Latency didn't move.
- Week 1: We blamed Redis connection pool exhaustion. Tuned it obsessively. Nothing.
- Week 2: We blamed our ORM. Rewrote three hot queries to raw SQL. Slight improvement — placebo.
- Week 2: We blamed DigitalOcean's networking layer. Filed a support ticket. They found nothing wrong.
- Week 3: We upgraded Node from 18.12 to 18.19. Same story.
The APM traces showed a consistent 700ms gap between when a request arrived and when the first line of handler code executed. We kept staring at that gap, convinced it was the network or a middleware issue. It was the queue. Requests were waiting 700ms to even start because the event loop was perpetually occupied.
The breakthrough came when a junior engineer on the team — fresh out of a Node.js fundamentals course — read through the feature flags utility and asked, quietly: "Why are we using readFileSync here?"
I'll be honest. My first reaction was mild condescension. It's a small file, it's fine. Then I pulled up the production config. 340KB. Called 500 times per second. I felt my stomach drop.
The Fix (and the Right Mental Model)
The fix wasn't just swapping readFileSync for readFile. That would
be addressing the symptom. The real solution was to never read the file on every request —
load it once, cache it in memory, and invalidate the cache on a sensible schedule.
import fs from 'fs/promises';
import path from 'path';
interface FeatureConfig {
[tenantId: string]: FeatureFlags;
}
let cachedConfig: FeatureConfig | null = null;
let cacheLoadedAt: number = 0;
const CACHE_TTL_MS = 60_000; // Refresh every 60 seconds
async function loadConfig(): Promise<FeatureConfig> {
const now = Date.now();
if (cachedConfig && now - cacheLoadedAt < CACHE_TTL_MS) {
return cachedConfig;
}
const configPath = path.join(__dirname, '../config/features.json');
const raw = await fs.readFile(configPath, 'utf-8'); // Non-blocking
cachedConfig = JSON.parse(raw);
cacheLoadedAt = now;
return cachedConfig!;
}
// Now async — caller must await it
export async function getFeatureFlags(tenantId: string): Promise<FeatureFlags> {
const config = await loadConfig();
return config[tenantId] ?? config['default'];
}
We deployed this at 2:17 AM on a Thursday. Within 90 seconds, P99 latency dropped from 900ms back to 118ms. The event loop was free. Requests stopped queueing. The server — which had been straining under what we thought was inadequate resources — was suddenly idling at 12% CPU.
What the Event Loop Actually Looks Like
After the incident, I spent a day digging into how the event loop actually processes work. Node.js uses libuv under the hood — the event loop has distinct phases that execute in order on every "tick". Understanding this makes it obvious why blocking anywhere is so destructive.
NODE.JS EVENT LOOP PHASES (per tick)
──────────────────────────────────────────────────────────
┌─────────────────────────────────────┐
│ timers phase │ ← setTimeout / setInterval callbacks
└───────────────────┬─────────────────┘
▼
┌─────────────────────────────────────┐
│ pending callbacks │ ← I/O errors from last tick
└───────────────────┬─────────────────┘
▼
┌─────────────────────────────────────┐
│ idle / prepare │ ← internal use
└───────────────────┬─────────────────┘
▼
┌─────────────────────────────────────┐
│ poll phase │ ← retrieve new I/O events ← YOU WANT TO BE HERE
│ (waits here if queue is empty) │ callbacks fire; new requests arrive
└───────────────────┬─────────────────┘
▼
┌─────────────────────────────────────┐
│ check phase │ ← setImmediate callbacks
└───────────────────┬─────────────────┘
▼
┌─────────────────────────────────────┐
│ close callbacks │ ← socket.on('close', ...)
└───────────────────┬─────────────────┘
│
└──────────────► next tick (process.nextTick / Promises)
A synchronous block anywhere in this loop halts ALL phases.
Every in-flight request waits. Every timer drifts. Everything freezes.
The Rules I Now Live By
This incident rewired how I review Node.js code — mine and everyone else's. Here's what I look for now:
-
Grep for Sync.
readFileSync,writeFileSync,execSync,spawnSync— any of these in a request path is a red flag. Fine for CLI tools, startup scripts, or build steps. Never in a hot path. - JSON.parse on large payloads is also blocking. Parsing a 5MB JSON response synchronously blocks the event loop too. Stream large payloads, or offload parsing to a worker thread.
-
CPU-intensive work belongs in worker threads. Image resizing,
cryptographic operations, report generation — anything that takes more than ~1ms of
pure CPU belongs in
worker_threadsor a dedicated process. - Cache aggressively at the application layer. Reading config on every request is never the right answer. Load once, cache with TTL, refresh in background.
- Measure, don't assume. We assumed the bottleneck was the database because databases are usually the bottleneck. Assumptions are expensive. Profile first.
The Human Part
The junior engineer who spotted it got a public shoutout in our team retrospective. The lesson wasn't lost on anyone — sometimes the most dangerous assumptions are made by the most experienced people in the room. Expertise can breed blind spots. Fresh eyes matter.
We also introduced a lint rule after this:
// Disallow synchronous fs methods in non-script contexts
"no-restricted-imports": ["error", {
"paths": [{
"name": "fs",
"importNames": ["readFileSync", "writeFileSync", "existsSync", "mkdirSync"],
"message": "Use fs/promises async methods instead. Sync I/O blocks the event loop."
}]
}]
It's not a perfect rule — there are legitimate uses for sync I/O — but it forces a conscious decision every time someone reaches for it.
The Takeaway
The event loop is Node's superpower. It's the reason a single-threaded runtime can handle thousands of concurrent connections. But it's a contract — you agree not to block it, and in return you get extraordinary concurrency. Break that contract, even briefly, even innocently, and you pay for it in latency felt by every user.
The four characters that cost us three weeks of debugging and several nights of lost sleep?
Sync. That's it. A suffix. A habit. A small shortcut that felt like nothing
in development and felt like everything in production.
The next time you're about to type readFileSync in a web server, pause. Ask
yourself: am I in a hot path? Could this call happen more than once per second? If yes —
reach for the async version, add a cache, and spare your future self three weeks of
chasing ghosts.