The Night the Cluster Went Silent
← Back
March 3, 2026ElasticSearch14 min read

The Night the Cluster Went Silent

Published March 3, 202614 min read

It was 11:47 PM on a Tuesday. My phone lit up with three PagerDuty alerts in rapid succession. Search is down. Response time: 32 seconds. Then 60. Then nothing.

40,000 active users. A product where search isn't a feature — it's the feature. And somewhere deep inside our ElasticSearch cluster, something had gone very, very quiet.

This is the story of that night. The debugging, the panic, the three-hour war room that probably gave our CTO a new grey hair. And more importantly — the architectural decisions we made the morning after that meant it never happened again.


The Setup

We were running a 6-node ElasticSearch cluster on AWS — 3 master-eligible nodes, 3 data nodes. About 200 million documents across 12 indices. Nothing exotic. We'd been running this config for two years without incident.

┌─────────────────────────────────────────────────────────────────┐
│                    Our Cluster (Before)                         │
│                                                                 │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐                  │
│   │ Master 1 │   │ Master 2 │   │ Master 3 │  ← Eligible       │
│   │ (Active) │   │(Standby) │   │(Standby) │    masters        │
│   └────┬─────┘   └──────────┘   └──────────┘                  │
│        │                                                        │
│        ▼         Shard Distribution                             │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐                  │
│   │  Data 1  │   │  Data 2  │   │  Data 3  │                  │
│   │ 34 shards│   │ 33 shards│   │ 33 shards│                  │
│   └──────────┘   └──────────┘   └──────────┘                  │
│                                                                 │
│   Total: 200M docs · 12 indices · ~180GB on disk               │
└─────────────────────────────────────────────────────────────────┘
  

The query pattern was straightforward: full-text search with filters, aggregations for faceted results, and a light geo-distance query for location-based ranking. P99 latency was hovering around 180ms. Fine.

The Trigger Nobody Noticed

Six days before the outage, someone on the team — in good faith — ran a reindex to update our mapping. They added a new nested field for product variants. The reindex completed successfully. Monitoring looked clean. Everyone went home.

What nobody noticed: the new index had been created with 1 shard and 0 replicas. The index template had been updated, but the override settings on the reindex command bypassed it.

BEFORE reindex:          AFTER reindex (oops):
┌─────────────────┐      ┌─────────────────┐
│ products_v3     │      │ products_v4     │
│ Shards:  5      │  →   │ Shards:  1      │ ← 🔴 Single shard!
│ Replicas: 1     │      │ Replicas: 0     │ ← 🔴 No replicas!
│ Nodes:  3       │      │ Node: Data 1    │ ← all on one node
└─────────────────┘      └─────────────────┘

Nobody noticed because search still worked fine.
The problem was coming — just slowly.
  

For six days, all search traffic for products routed through a single shard on a single node. The node's heap started creeping up. Slowly. Imperceptibly. Until it wasn't.

11:47 PM — The Chain Reaction

Here's what happened in the 11 minutes before full blackout:

11:36 PM  Data Node 1 heap reaches 85%
          GC starts running more frequently
          Query latency creeps from 180ms → 450ms
          
11:39 PM  A batch job triggers 500 concurrent searches
          (normal — runs every night at this time)
          
11:41 PM  Data Node 1 heap hits 95%
          Full GC kicks in — node pauses for 4.2 seconds
          ElasticSearch marks node as unresponsive
          
11:43 PM  Master tries to relocate the shard
          But: no replicas exist, no other node has the data
          Shard goes RED — unassigned, unavailable
          
11:45 PM  products_v4 index goes RED
          All product searches return 503
          
11:47 PM  PagerDuty fires
          My phone explodes 📱
  

The War Room

Three of us on a Zoom call. Me, the backend lead, and our CTO who happened to still be online. The first instinct was wrong — and it made things worse.

"Let's restart the node."

Don't restart the node.

Restarting Data Node 1 while it had the only copy of 200M product documents would have triggered a full shard recovery — reading everything from disk. On a heap-starved node. It would have taken 45 minutes minimum, and potentially never completed.

What we actually did — the right call:

Step 1: Stop the bleeding
────────────────────────────────────────────────────────────────
PUT /products_v4/_settings
{ "index.routing.allocation.exclude._name": "" }

→ Prevent ElasticSearch from trying to move the shard around
  and stressing the node further.


Step 2: Buy the node some air
────────────────────────────────────────────────────────────────
POST /_nodes/data-node-1/hot_threads

→ Identify what's consuming heap. 
  Turned out: a script_score query doing float[] operations.
  Disabled it temporarily via feature flag.


Step 3: Force GC, carefully
────────────────────────────────────────────────────────────────
POST /_nodes/data-node-1/_flush?wait_if_ongoing=true

→ Flush pending operations to reduce in-memory pressure.
  Heap drops from 95% → 71%.
  Node stabilises.


Step 4: Add replicas — NOW
────────────────────────────────────────────────────────────────
PUT /products_v4/_settings
{ "number_of_replicas": 1 }

→ ElasticSearch starts copying shards to Data Node 2.
  This takes 18 minutes.
  During this time: still degraded but not dead.


Step 5: Search is back
────────────────────────────────────────────────────────────────
2:03 AM — Replica assignment completes.
          Green. All shards assigned.
          Latency: 220ms (slightly elevated, acceptable).
  

Total downtime: 2 hours 16 minutes. The root cause: a reindex command with hardcoded settings that overrode the index template. Six days of slow poison.


What We Built the Morning After

At 9 AM the next day, the three of us sat down with coffee and made a pact: this configuration class of failure would never be possible again. Here's the architecture we put in place.

1. Index Template Enforcement

The root cause was a reindex command that bypassed the template. We fixed this at the code level — all reindex operations now go through a single Python utility that:

  • Reads settings from the index template before reindexing
  • Validates number_of_shards ≥ 3 and number_of_replicas ≥ 1 before proceeding
  • Refuses to run if validation fails — with an error that names the exact misconfiguration
reindex_safe.py
def validate_index_config(index_name: str, settings: dict) -> None:
    shards = settings.get("number_of_shards", 0)
    replicas = settings.get("number_of_replicas", -1)
    
    if int(shards) < 3:
        raise ValueError(
            f"[SAFETY] {index_name}: number_of_shards={shards}. "
            f"Minimum is 3. Reindex aborted."
        )
    if int(replicas) < 1:
        raise ValueError(
            f"[SAFETY] {index_name}: number_of_replicas={replicas}. "
            f"Minimum is 1. Reindex aborted."
        )

2. Heap Monitoring with Proactive Alerts

Our previous monitoring alerted at 90% heap. By then, you're already in trouble. We moved to a tiered alerting system:

Heap %    Action
──────────────────────────────────────────────────
< 70%     ✅ All good. No action.

70–79%    📊 Log to Datadog dashboard.
          Weekly review in team standup.

80–84%    ⚠️  Slack alert to #infra-alerts.
          On-call engineer reviews within 1 hour.

85–89%    🔔 PagerDuty LOW priority.
          On-call must acknowledge within 30 min.
          Begin heap reduction procedures.

90%+      🚨 PagerDuty HIGH priority.
          Immediate response required.
          Auto-scales data nodes if possible.
  

3. Automated Cluster Health Check

Every 5 minutes, a Lambda function runs a health check against /_cluster/health and /_cat/shards?h=index,shard,prirep,state,node. If any shard is in UNASSIGNED state for more than 2 minutes, it fires an alert with the exact shard and index name — before the cascade begins.

┌─────────────────────────────────────────────────────────────────┐
│                   Health Check Flow                             │
│                                                                 │
│  ┌──────────┐    every    ┌─────────────────────────────────┐  │
│  │  Lambda  │────5 min───▶│  GET /_cluster/health           │  │
│  └──────────┘             │  GET /_cat/shards?state=UNASSIGN│  │
│                           └──────────────┬──────────────────┘  │
│                                          │                      │
│                              ┌───────────┴───────────┐         │
│                              │                       │         │
│                         GREEN/YELLOW              RED / any    │
│                         No unassigned           UNASSIGNED     │
│                              │                   shard > 2min  │
│                              ▼                       │         │
│                          ✅ OK                        ▼         │
│                                             ┌────────────────┐ │
│                                             │ PagerDuty +    │ │
│                                             │ Slack with     │ │
│                                             │ shard details  │ │
│                                             └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
  

4. Circuit Breaker at the Application Layer

The batch job that triggered the cascade — 500 concurrent searches at 11:39 PM — had no throttle. We added a Redis-backed circuit breaker in front of all ElasticSearch calls:

Request comes in
      │
      ▼
┌─────────────────────┐
│  Check circuit      │    State stored in Redis
│  breaker state      │    TTL: 30 seconds
└──────────┬──────────┘
           │
    ┌──────┴──────┐
    │             │
  CLOSED        OPEN
 (normal)    (tripped)
    │             │
    ▼             ▼
 Run query    Return cached
              results or
              503 with
              Retry-After
    │
    ▼
Success? → Reset error count
Failed?  → Increment counter
           If counter > threshold:
           TRIP the breaker
           Notify Slack
  

The Numbers, Six Months Later

Here's what changed after implementing all four of the above:

0 Unplanned downtime incidents
Earlier warning on heap pressure
7 min Avg detection-to-alert time
140ms P99 search latency (down from 180ms)

That last one was a surprise. The forced cleanup of bad configurations and the heap monitoring changes led us to also fix a slow query that had been silently degrading performance for months. Sometimes disasters are how you find the things you didn't know to look for.


What I'd Tell Past Me

Four things, in order of importance:

  1. Never run a reindex without verifying shard and replica settings output. Print them. Confirm them. Make a checklist if you have to.
  2. Alert at 80% heap, not 90%. At 90% you're already reacting. At 80% you're preventing.
  3. Every background job that hits ElasticSearch needs a concurrency limit. "This never runs at the same time as traffic" is something that's true until it isn't.
  4. When a node is in distress, don't restart it. Reduce load first. Restart is often the instinct. It's often wrong.

The cluster has been quiet since. Which is exactly how infrastructure should be.

— Darshan

Share this
← All Posts14 min read