March 3, 2026ElasticSearch14 min read

The Night the Cluster Went Silent

Published March 3, 202614 min read

11:47 PM, Tuesday. Three PagerDuty alerts back-to-back on my phone. Search down. Response time 32 seconds. Then 60. Then nothing.

40,000 active users. A product where search is the product. Something deep in our ElasticSearch cluster had gone very quiet.

The setup

6-node ElasticSearch cluster on AWS. 3 master-eligible nodes, 3 data nodes. About 200 million documents across 12 indices. Nothing exotic. The config had been running for two years without incident.

┌─────────────────────────────────────────────────────────────────┐
│                    Our Cluster (Before)                         │
│                                                                 │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐                  │
│   │ Master 1 │   │ Master 2 │   │ Master 3 │  ← Eligible       │
│   │ (Active) │   │(Standby) │   │(Standby) │    masters        │
│   └────┬─────┘   └──────────┘   └──────────┘                  │
│        │                                                        │
│        ▼         Shard Distribution                             │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐                  │
│   │  Data 1  │   │  Data 2  │   │  Data 3  │                  │
│   │ 34 shards│   │ 33 shards│   │ 33 shards│                  │
│   └──────────┘   └──────────┘   └──────────┘                  │
│                                                                 │
│   Total: 200M docs · 12 indices · ~180GB on disk               │
└─────────────────────────────────────────────────────────────────┘

Query pattern was straightforward. Full-text search with filters, aggregations for faceted results, a light geo-distance query for location-based ranking. P99 latency around 180ms. Fine.

The trigger nobody noticed

Six days before the outage, someone on the team ran a reindex to update our mapping. They added a new nested field for product variants. The reindex completed successfully. Monitoring looked clean. Everyone went home.

What nobody noticed: the new index had been created with 1 shard and 0 replicas. The index template had been updated, but the override settings on the reindex command bypassed it.

BEFORE reindex:          AFTER reindex (oops):
┌─────────────────┐      ┌─────────────────┐
│ products_v3     │      │ products_v4     │
│ Shards:  5      │  →   │ Shards:  1      │ ← 🔴 Single shard!
│ Replicas: 1     │      │ Replicas: 0     │ ← 🔴 No replicas!
│ Nodes:  3       │      │ Node: Data 1    │ ← all on one node
└─────────────────┘      └─────────────────┘

Nobody noticed because search still worked fine.
The problem was coming — just slowly.

For six days, all search traffic for products routed through a single shard on a single node. The node's heap crept up slowly. Imperceptibly. Until it wasn't.

11:47 PM — the chain reaction

Here's what happened in the 11 minutes before full blackout:

11:36 PM  Data Node 1 heap reaches 85%
          GC starts running more frequently
          Query latency creeps from 180ms → 450ms
          
11:39 PM  A batch job triggers 500 concurrent searches
          (normal — runs every night at this time)
          
11:41 PM  Data Node 1 heap hits 95%
          Full GC kicks in — node pauses for 4.2 seconds
          ElasticSearch marks node as unresponsive
          
11:43 PM  Master tries to relocate the shard
          But: no replicas exist, no other node has the data
          Shard goes RED — unassigned, unavailable
          
11:45 PM  products_v4 index goes RED
          All product searches return 503
          
11:47 PM  PagerDuty fires
          My phone explodes 📱

The war room

Three of us on a Zoom call. Me, the backend lead, and our CTO who happened to still be online. The first instinct was wrong.

"Let's restart the node."

Don't restart the node.

Restarting Data Node 1 while it had the only copy of 200M product documents would have triggered a full shard recovery, reading everything from disk on a heap-starved node. 45 minutes minimum, potentially never completing. The backend lead was the one who stopped me. I owe him a beer.

What we actually did.

Step 1: Stop the bleeding
────────────────────────────────────────────────────────────────
PUT /products_v4/_settings
{ "index.routing.allocation.exclude._name": "" }

→ Prevent ElasticSearch from trying to move the shard around
and stressing the node further.

Step 2: Buy the node some air
────────────────────────────────────────────────────────────────
POST /_nodes/data-node-1/hot_threads

→ Identify what's consuming heap.
Turned out: a script_score query doing float[] operations.
Disabled it temporarily via feature flag.

Step 3: Force GC, carefully
────────────────────────────────────────────────────────────────
POST /_nodes/data-node-1/_flush?wait_if_ongoing=true

→ Flush pending operations to reduce in-memory pressure.
Heap drops from 95% → 71%.
Node stabilises.

Step 4: Add replicas — NOW
────────────────────────────────────────────────────────────────
PUT /products_v4/_settings
{ "number_of_replicas": 1 }

→ ElasticSearch starts copying shards to Data Node 2.
This takes 18 minutes.
During this time: still degraded but not dead.

Step 5: Search is back
────────────────────────────────────────────────────────────────
2:03 AM — Replica assignment completes.
Green. All shards assigned.
Latency: 220ms (slightly elevated, acceptable).

Total downtime: 2 hours 16 minutes. Root cause: a reindex command with hardcoded settings that overrode the index template. Six days of slow poison.

What we built the morning after

9 AM the next day, the three of us sat down with coffee and made a pact. This class of failure would never be possible again. Here's what we put in place.

1. Index template enforcement

The root cause was a reindex command that bypassed the template. We fixed it at the code level. All reindex operations now go through a single Python utility that:

Reads settings from the index template before reindexing
Validates number_of_shards ≥ 3 and number_of_replicas ≥ 1 before proceeding
Refuses to run if validation fails — with an error that names the exact misconfiguration

reindex_safe.py

def validate_index_config(index_name: str, settings: dict) -> None:
    shards = settings.get("number_of_shards", 0)
    replicas = settings.get("number_of_replicas", -1)
    
    if int(shards) < 3:
        raise ValueError(
            f"[SAFETY] {index_name}: number_of_shards={shards}. "
            f"Minimum is 3. Reindex aborted."
        )
    if int(replicas) < 1:
        raise ValueError(
            f"[SAFETY] {index_name}: number_of_replicas={replicas}. "
            f"Minimum is 1. Reindex aborted."
        )

2. Heap monitoring with proactive alerts

Our previous monitoring alerted at 90% heap. By 90%, you're already in trouble. We moved to a tiered alerting system.

Heap %    Action
──────────────────────────────────────────────────
< 70%     ✅ All good. No action.

70–79%    📊 Log to Datadog dashboard.
          Weekly review in team standup.

80–84%    ⚠️  Slack alert to #infra-alerts.
          On-call engineer reviews within 1 hour.

85–89%    🔔 PagerDuty LOW priority.
          On-call must acknowledge within 30 min.
          Begin heap reduction procedures.

90%+      🚨 PagerDuty HIGH priority.
          Immediate response required.
          Auto-scales data nodes if possible.

3. Automated cluster health check

Every 5 minutes, a Lambda runs a health check against /_cluster/health and /_cat/shards?h=index,shard,prirep,state,node. If any shard is UNASSIGNED for more than 2 minutes, it fires an alert with the exact shard and index name before the cascade begins.

┌─────────────────────────────────────────────────────────────────┐
│                   Health Check Flow                             │
│                                                                 │
│  ┌──────────┐    every    ┌─────────────────────────────────┐  │
│  │  Lambda  │────5 min───▶│  GET /_cluster/health           │  │
│  └──────────┘             │  GET /_cat/shards?state=UNASSIGN│  │
│                           └──────────────┬──────────────────┘  │
│                                          │                      │
│                              ┌───────────┴───────────┐         │
│                              │                       │         │
│                         GREEN/YELLOW              RED / any    │
│                         No unassigned           UNASSIGNED     │
│                              │                   shard > 2min  │
│                              ▼                       │         │
│                          ✅ OK                        ▼         │
│                                             ┌────────────────┐ │
│                                             │ PagerDuty +    │ │
│                                             │ Slack with     │ │
│                                             │ shard details  │ │
│                                             └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

4. Circuit breaker at the application layer

The batch job that triggered the cascade (500 concurrent searches at 11:39 PM) had no throttle. We added a Redis-backed circuit breaker in front of all ElasticSearch calls.

Request comes in
      │
      ▼
┌─────────────────────┐
│  Check circuit      │    State stored in Redis
│  breaker state      │    TTL: 30 seconds
└──────────┬──────────┘
           │
    ┌──────┴──────┐
    │             │
  CLOSED        OPEN
 (normal)    (tripped)
    │             │
    ▼             ▼
 Run query    Return cached
              results or
              503 with
              Retry-After
    │
    ▼
Success? → Reset error count
Failed?  → Increment counter
           If counter > threshold:
           TRIP the breaker
           Notify Slack

The numbers, six months later

Here's what changed after implementing all four of the above:

0 Unplanned downtime incidents

3× Earlier warning on heap pressure

7 min Avg detection-to-alert time

140ms P99 search latency (down from 180ms)

That last one was a surprise. The forced cleanup of bad configs and the heap monitoring changes led us to also fix a slow query that had been silently degrading performance for months. Sometimes disasters are how you find the things you weren't looking for.

What I'd tell past me

Four things, in order of importance:

Never run a reindex without verifying shard and replica settings output. Print them. Confirm them. Make a checklist if you have to.
Alert at 80% heap, not 90%. At 90% you're reacting. At 80% you're preventing.
Every background job that hits ElasticSearch needs a concurrency limit. "This never runs at the same time as traffic" is true until it isn't.
When a node is in distress, don't restart it. Reduce load first. Restart is the instinct, and it's usually wrong.

The cluster has been quiet since.