March 10, 2026NodeJS11 min read

The Invisible Bottleneck: How One Sync Call Froze Our Node.js API

Published March 10, 202611 min read

The alert came in on a Tuesday morning. Not a crash. I'd have preferred a crash. This was the worst kind of production problem: everything was still running, but the server felt like it was wading through concrete. P99 latency had gone from 120ms to over 900ms overnight. No errors. No memory spikes. No CPU thrashing. Just slow.

This is the story of how a four-character suffix (Sync) held our Node.js API hostage for three weeks.

The setup

Multi-tenant SaaS API. Express on Node 18, DigitalOcean Kubernetes, 400-600 req/s at peak. Mature stack. APM, distributed tracing, a solid alerting setup. We were not rookies, which is going to matter more than it should.

Three weeks before the incident, we'd shipped per-tenant feature flags. Each API request would now check a JSON config to determine what the requesting tenant had access to. Simple, clean, and in retrospect not as elegant as I'd convinced myself.

The implementation lived in a utility module:

utils/features.ts (original — the offender)

import fs from 'fs';
import path from 'path';

// Called on every authenticated request
export function getFeatureFlags(tenantId: string): FeatureFlags {
  const configPath = path.join(__dirname, '../config/features.json');
  
  // 🚨 This line. Right here. This is the problem.
  const raw = fs.readFileSync(configPath, 'utf-8');
  const config = JSON.parse(raw);
  
  return config[tenantId] ?? config['default'];
}

In local dev the JSON was 12KB and the function took about 0.8ms. With test traffic of 5-10 req/s that was invisible. In production, the file had grown to 340KB as we onboarded tenants. The function now took 4-8ms per call. We were calling it 500 times per second.

Why this breaks everything (the event loop explained)

To understand why this was catastrophic, you need to remember what Node.js actually is: a single-threaded runtime built around a non-blocking event loop. The event loop is the engine that lets one thread handle thousands of concurrent connections by never waiting. It processes a request, delegates I/O to the OS, picks up the next request, and comes back when the I/O is done.

The critical rule is never block the event loop. When you call a synchronous function, the whole loop stops. Nothing else runs. Every other in-flight request freezes in place until your sync operation completes.

  NODE.JS EVENT LOOP — HEALTHY (async I/O)
  ─────────────────────────────────────────────────────────

  Request A ──► [Handler starts] ──► [Awaits DB] ──────────────► [Responds]
                                          │                            ▲
                                          ▼                            │
  Request B ──────────────────────► [Handler runs while A waits] ─────┘
  Request C ──────────────────────────────────────────────────────► [Handler runs]

  Event loop keeps spinning. No waiting. High throughput. ✓

  ─────────────────────────────────────────────────────────

  NODE.JS EVENT LOOP — BLOCKED (sync I/O)
  ─────────────────────────────────────────────────────────

  Request A ──► [readFileSync — blocks loop for 7ms] ──► [Responds]
                     │
                     ▼
  Request B ──► [WAITING ·················] ──► [Responds, 7ms late]
  Request C ──► [WAITING ·······················] ──► [Responds, 7ms late]
  Request D ──► [WAITING ·····························] ──► ...
  Request E ──► [WAITING ···································] ──► ...

  500 req/sec × 7ms block = event loop starved. ✗

Do the math. 500 requests per second, each blocking the loop for 7ms. That's 3,500ms of blocking per second. There are only 1,000ms in a second. The event loop was never caught up. Every request was queueing behind every other request. P99 latency wasn't 900ms because the operation was slow. It was 900ms because requests were spending 780ms waiting for their turn.

Three weeks of wrong theories

Here's the part that still stings. We spent three weeks blaming the wrong things.

Week 1: Postgres. Added read replicas. Latency didn't move.
Week 1: Redis connection pool exhaustion. Tuned it obsessively. Nothing.
Week 2: The ORM. Rewrote three hot queries as raw SQL. Slight improvement, which turned out to be placebo.
Week 2: DigitalOcean's networking layer. Filed a support ticket. They found nothing wrong, politely.
Week 3: Upgraded Node from 18.12 to 18.19. Same story.

APM traces showed a consistent 700ms gap between request arrival and the first line of handler code. We stared at that gap for days, convinced it was network or middleware.

It was the queue. Requests were waiting 700ms to even start because the event loop was perpetually occupied.

"The hardest bugs to find aren't the ones that crash your app. They're the ones that make it work just well enough to keep you chasing ghosts."

The break came from the newest engineer on the team, two months into the job, who read through the feature flags utility and asked, kind of quietly: "Why are we using readFileSync here?"

I'll be honest, my first reaction was mild condescension. It's a small file, I thought, it's fine. Then I pulled up the production config. 340KB. 500 calls per second. My stomach dropped.

The fix (and the right mental model)

The fix wasn't just swapping readFileSync for readFile. That would treat the symptom. The real fix was never reading the file on every request. Load it once, cache it in memory, refresh on a sensible schedule.

utils/features.ts (fixed)

import fs from 'fs/promises';
import path from 'path';

interface FeatureConfig {
  [tenantId: string]: FeatureFlags;
}

let cachedConfig: FeatureConfig | null = null;
let cacheLoadedAt: number = 0;
const CACHE_TTL_MS = 60_000; // Refresh every 60 seconds

async function loadConfig(): Promise<FeatureConfig> {
  const now = Date.now();
  if (cachedConfig && now - cacheLoadedAt < CACHE_TTL_MS) {
    return cachedConfig;
  }

  const configPath = path.join(__dirname, '../config/features.json');
  const raw = await fs.readFile(configPath, 'utf-8'); // Non-blocking
  cachedConfig = JSON.parse(raw);
  cacheLoadedAt = now;
  return cachedConfig!;
}

// Now async — caller must await it
export async function getFeatureFlags(tenantId: string): Promise<FeatureFlags> {
  const config = await loadConfig();
  return config[tenantId] ?? config['default'];
}

Deployed at 2:17 AM on a Thursday. Within 90 seconds, P99 dropped from 900ms to 118ms. The event loop was free. Requests stopped queueing. The server, which we'd convinced ourselves was under-resourced, was suddenly idling at 12% CPU.

900ms P99 before fix

118ms P99 after fix

3 weeks Time to find it

4 chars Root cause: "Sync"

What the event loop actually looks like

After the incident, I spent a day actually reading how the event loop processes work. Node uses libuv under the hood, and the event loop has distinct phases that execute in order on every tick. Once you see this, blocking anywhere in the loop stops feeling abstract.

  NODE.JS EVENT LOOP PHASES (per tick)
  ──────────────────────────────────────────────────────────

      ┌─────────────────────────────────────┐
      │           timers phase              │  ← setTimeout / setInterval callbacks
      └───────────────────┬─────────────────┘
                          ▼
      ┌─────────────────────────────────────┐
      │         pending callbacks           │  ← I/O errors from last tick
      └───────────────────┬─────────────────┘
                          ▼
      ┌─────────────────────────────────────┐
      │           idle / prepare            │  ← internal use
      └───────────────────┬─────────────────┘
                          ▼
      ┌─────────────────────────────────────┐
      │              poll phase             │  ← retrieve new I/O events ← YOU WANT TO BE HERE
      │   (waits here if queue is empty)    │    callbacks fire; new requests arrive
      └───────────────────┬─────────────────┘
                          ▼
      ┌─────────────────────────────────────┐
      │              check phase            │  ← setImmediate callbacks
      └───────────────────┬─────────────────┘
                          ▼
      ┌─────────────────────────────────────┐
      │          close callbacks            │  ← socket.on('close', ...)
      └───────────────────┬─────────────────┘
                          │
                          └──────────────► next tick (process.nextTick / Promises)

  A synchronous block anywhere in this loop halts ALL phases.
  Every in-flight request waits. Every timer drifts. Everything freezes.

The rules I now live by

This incident rewired how I review Node.js code. Here's what I look for now.

Grep for Sync. readFileSync, writeFileSync, execSync, spawnSync. Any of these in a request path is a red flag. Fine for CLI tools, startup scripts, build steps. Never in a hot path.
JSON.parse on large payloads also blocks. Parsing a 5MB JSON response synchronously stalls the loop. Stream it, or push parsing onto a worker thread.
CPU-intensive work belongs in worker threads. Image resizing, crypto, report generation. Anything that takes more than about 1ms of pure CPU should live in worker_threads or a dedicated process.
Cache aggressively at the application layer. Reading config on every request is never the right answer. Load once, cache with TTL, refresh in the background.
Measure. We assumed the database was the bottleneck because the database is usually the bottleneck. Profile before you theorise.

The human part

The engineer who spotted it got a shoutout in our retrospective. The lesson wasn't lost on anyone, including me. Fresh eyes matter, and the most dangerous assumptions tend to come from the people who have "been doing this for years."

We also added a lint rule.

.eslintrc additions

// Disallow synchronous fs methods in non-script contexts
"no-restricted-imports": ["error", {
  "paths": [{
    "name": "fs",
    "importNames": ["readFileSync", "writeFileSync", "existsSync", "mkdirSync"],
    "message": "Use fs/promises async methods instead. Sync I/O blocks the event loop."
  }]
}]

It's not a perfect rule (there are legitimate uses for sync I/O), but it forces a conscious decision every time someone reaches for it.

The takeaway

The event loop is the reason a single-threaded runtime can handle thousands of concurrent connections. It's also a contract. You agree not to block it, and in return you get extraordinary concurrency. Break that contract, even briefly, and you pay for it in latency felt by every user.

Three weeks of debugging cost us a four-character suffix. Next time you're about to type readFileSync in a web server, pause. Ask: am I in a hot path? Could this call happen more than once per second? If yes, reach for the async version and add a cache.