One kafka-consumer-groups.sh Command Sent $180k in Duplicate Payments
← Back
March 21, 2026Architecture10 min read

One kafka-consumer-groups.sh Command Sent $180k in Duplicate Payments

Published March 21, 202610 min read

It was 2:47 AM when the first Slack alert fired. Then another. Then eleven more in under a minute. Our payments dashboard was showing duplicate charges — real money, real customers, real chaos. Within four minutes I was staring at a terminal trying to figure out how a system that had been running flawlessly for nine months had suddenly decided to reprocess every payment event from the past three hours.


Production Failure

The alert read: "Duplicate payment webhook detected — 847 events in 90 seconds." Our payment processing service consumed events from a Kafka topic called payment.completed, enriched them, and fired webhooks to merchant integrations. Under normal load we saw roughly 40–60 events per minute. Now we were seeing 847 in 90 seconds — roughly 14× our peak throughput.

First instinct: the payment gateway was retrying. I checked their status page — all green. I checked our idempotency table — keys were present, so the DB layer was rejecting duplicates at the application level. But the webhooks had already fired. The merchant callbacks had already gone out. Stripe charges had already been attempted.

By 3:05 AM we had counted 2,340 duplicate webhook deliveries, representing $183,200 in double-charged transactions. We killed the consumer service at 3:09 AM — 22 minutes after the first alert.


False Assumptions

Our first hour was wasted chasing the wrong culprits:

  • The payment gateway — clean status page, no retry storms on their end.
  • Our idempotency layer — keys existed in payment_events_processed but the check happened after the webhook was dispatched, not before. A race condition we'd never triggered under normal load.
  • A bad deploy — no deploy had happened in 11 hours. Git blame was clean.
  • The consumer crashing and re-subscribing — the pod logs showed no restarts, no exceptions, no panics. The consumer was healthy and happy, merrily processing events.

We were looking at the wrong layer entirely. The issue wasn't in the consumer code. It wasn't in the application logic. It was in the Kafka offset metadata — a layer most of the team never thought about at 3 AM.


Investigation

At 3:15 AM I pulled the consumer group lag metrics from our Kafka monitoring dashboard. What I saw made my stomach drop.

CONSUMER GROUP: payments-webhook-processor
TOPIC: payment.completed  PARTITIONS: 12

                     BEFORE INCIDENT        AT 3:00 AM
                     ─────────────────────────────────
Partition  0         offset 1,847,204       offset 1,845,901  ← WENT BACK
Partition  1         offset 1,849,012       offset 1,845,644  ← WENT BACK
Partition  2         offset 1,848,731       offset 1,845,902  ← WENT BACK
...
Partition 11         offset 1,847,890       offset 1,846,003  ← WENT BACK

LAG (total):         0 messages             2,341 messages
EARLIEST OFFSET:     1,845,600 (approx)    ← all partitions reset to ~here

Every partition had its offset reset backwards by approximately 1,200–1,400 messages. The consumer group didn't crash. It didn't reconnect. Someone had explicitly reset the offsets.

I searched the audit log for the Kafka admin UI we use internally. Nothing. I checked our runbook Confluence pages for any scheduled maintenance. Nothing. Then I opened the #infra-ops Slack channel and searched for "kafka". There it was — a message from our platform engineer at 2:44 AM:

"Running offset reset on the payments-webhook-processor group to fix the lag on staging. Forgot I was in the prod context 😬"

He'd run the following command intending to target staging:

bash
kafka-consumer-groups.sh   --bootstrap-server kafka-prod.internal:9092   --group payments-webhook-processor   --topic payment.completed   --reset-offsets   --to-earliest   --execute

--to-earliest reset every partition to the earliest retained offset — about 1,200 messages back, covering the last 3 hours of production traffic. The consumer picked up exactly where it was told to start and faithfully reprocessed everything.


Root Cause

Two failures compounded each other:

1. Wrong Kafka bootstrap server in the terminal context. Our engineer had been debugging a lag issue on staging earlier that night. He closed his staging terminal session, opened a new one, and copy-pasted the reset command without noticing the bootstrap server URL in his clipboard pointed to kafka-prod.internal instead of kafka-staging.internal. The command ran in under a second with no confirmation prompt.

2. Idempotency checked too late. Our webhook dispatcher had this flow:

BEFORE FIX (broken flow):

  Kafka Event
      │
      ▼
  Enrich payload
      │
      ▼
  Dispatch webhook  ← fires HTTP request HERE
      │
      ▼
  INSERT INTO payment_events_processed (event_id)
  ON CONFLICT DO NOTHING                          ← idempotency check AFTER dispatch
      │
      ▼
  Commit Kafka offset

AFTER FIX (correct flow):

  Kafka Event
      │
      ▼
  SELECT 1 FROM payment_events_processed
  WHERE event_id = ${eventId}
      │
  Already processed? → SKIP → Commit offset
      │
      ▼ (not processed)
  Enrich payload
      │
      ▼
  BEGIN TRANSACTION
    INSERT INTO payment_events_processed
    Dispatch webhook (within tx timeout)
  COMMIT
      │
      ▼
  Commit Kafka offset

The idempotency key was there, but it was written after the side effect, not checked before it. Under normal operation this never mattered because we never replayed events. The offset reset turned our theoretical weakness into a $183k incident.


The Fix

We worked in parallel on two tracks: immediate remediation and structural hardening.

Immediate (night of incident):

  • Killed all consumer pods at 3:09 AM, stopping the bleeding at 2,340 duplicate events.
  • Manually reset offsets forward to the pre-incident position using the offset coordinates from our monitoring snapshot (we had per-partition offset metrics with 30-second granularity).
  • Restarted consumers at 3:31 AM — 44 minutes of downtime for the webhook service.
  • Worked with the payment gateway to void the 2,340 duplicate charges. All but 17 were successfully voided within 6 hours. The remaining 17 required manual customer refunds.

Structural (next sprint):

typescript — idempotency check BEFORE dispatch
async function processPaymentEvent(event: PaymentEvent): Promise {
  // Check FIRST — before any side effects
  const alreadyProcessed = await db.queryOne(
    'SELECT 1 FROM payment_events_processed WHERE event_id = ${1}',
    [event.id]
  );

  if (alreadyProcessed) {
    logger.info({ eventId: event.id }, 'Skipping duplicate event');
    return; // Kafka offset will still be committed
  }

  const payload = await enrichPayload(event);

  // Transactional: record + dispatch atomically
  await db.transaction(async (tx) => {
    await tx.execute(
      'INSERT INTO payment_events_processed (event_id, processed_at) VALUES (${1}, NOW())',
      [event.id]
    );
    await dispatchWebhook(payload); // throws on failure → tx rolls back
  });
}

We also added three operational guards to prevent the offset reset from being possible again without a deliberate multi-step process:

bash — wrapper script: kafka-reset-offsets.sh
#!/usr/bin/env bash
set -euo pipefail

ENVIRONMENT=${1:?Usage: kafka-reset-offsets.sh    }
GROUP=${2:?}
TOPIC=${3:?}
STRATEGY=${4:?}  # e.g. --to-offset 1847204 or --to-datetime 2026-03-21T02:00:00.000

if [[ "$ENVIRONMENT" == "prod" ]]; then
  echo "⚠️  PRODUCTION offset reset requested."
  echo "Group:    $GROUP"
  echo "Topic:    $TOPIC"
  echo "Strategy: $STRATEGY"
  echo ""
  read -p "Type 'RESET PRODUCTION $GROUP' to confirm: " CONFIRM
  if [[ "$CONFIRM" != "RESET PRODUCTION $GROUP" ]]; then
    echo "Confirmation failed. Aborting."
    exit 1
  fi
fi

kafka-consumer-groups.sh   --bootstrap-server "kafka-${ENVIRONMENT}.internal:9092"   --group "$GROUP" --topic "$TOPIC"   --reset-offsets $STRATEGY --execute

Lessons Learned

1. Idempotency must be a pre-condition, not a post-record. If your idempotency key is written after the side effect fires, it only prevents duplicate storage — not duplicate action. The check must happen before any externally visible work. This is true for webhooks, emails, charge APIs, anything with observable side effects.

2. Kafka offset resets have no undo. Unlike most database operations, a Kafka offset reset is immediate and committed the moment the command runs. There is no transaction to roll back. If you don't have per-partition offset metrics stored externally, you may not even know where to reset back to. We now ship partition offset snapshots to our metrics store every 30 seconds specifically for this recovery path.

3. Multi-environment terminal sessions are accidents waiting to happen. We now enforce that production Kafka access requires going through a dedicated jump host with a red-coloured terminal prompt and a mandatory KAFKA_ENV=prod environment variable that all wrapper scripts validate. Staging uses green. You feel the difference.

4. Operational blast radius should be bounded by design. Our webhook consumer had no rate limiter. When it got 1,200 extra events in its backlog, it processed them at full speed — 847 events in 90 seconds. Had we had a configurable MAX_EVENTS_PER_MINUTE guard, an alert on abnormal throughput spikes could have fired before a single duplicate webhook left our system.

5. The audit trail gap cost us an hour. We spent 60 minutes debugging application code when the answer was a one-line Slack message. We now pipe all kafka-consumer-groups.sh invocations through a wrapper that logs to our incident management system with the operator's identity, the command, and a timestamp — before execution.

The $183k was recovered. The 17 manual refunds were embarrassing but manageable. The real cost was the 22 minutes between first alert and containment, and the three weeks it took to rebuild merchant trust for three integrations that had to handle the duplicate charge reversal on their end.

One command. Twenty-two minutes. Two operational antipatterns that had coexisted harmlessly for nine months until a tired engineer fat-fingered a bootstrap server URL at 2:44 AM. Write your idempotency checks first. Guard your offset resets. And for the love of everything, colour-code your production terminals.

Share this
← All Posts10 min read