March 21, 2026Architecture10 min read

One kafka-consumer-groups.sh Command Sent $180k in Duplicate Payments

Published March 21, 202610 min read

2:47 AM. First Slack alert. Another, then eleven more in under a minute. The payments dashboard was showing duplicate charges. Real money, real customers.

Four minutes later I'm in a terminal trying to figure out how a system that had been running fine for nine months had suddenly decided to reprocess every payment event from the past three hours.

Production failure

The alert read: "Duplicate payment webhook detected: 847 events in 90 seconds." Our payment processing service consumed events from a Kafka topic called payment.completed, enriched them, and fired webhooks to merchant integrations. Normal load was 40 to 60 events per minute. We were seeing 847 in 90 seconds, roughly 14x peak throughput.

First instinct: the payment gateway was retrying. Their status page was all green. I checked our idempotency table. Keys were present, so the DB layer was rejecting duplicates at the application level. But the webhooks had already fired, the merchant callbacks had already gone out, the Stripe charges had already been attempted. That's when my stomach dropped, because an idempotency table that prevents duplicate records while the side effects have already happened is approximately zero help.

By 3:05 AM we had counted 2,340 duplicate webhook deliveries, representing $183,200 in double-charged transactions. We killed the consumer service at 3:09 AM, 22 minutes after the first alert.

False assumptions

The first hour was wasted chasing wrong culprits.

The payment gateway. Clean status page, no retry storms on their end.
Our idempotency layer. Keys existed in payment_events_processed, but the check happened after the webhook was dispatched, not before. A race condition we'd never triggered under normal load.
A bad deploy. No deploy had happened in 11 hours. Git blame was clean.
The consumer crashing and re-subscribing. Pod logs showed no restarts, no exceptions, no panics. The consumer was healthy and merrily processing events.

We were looking at the wrong layer entirely. The issue wasn't in the consumer code or the application logic. It was in the Kafka offset metadata, a layer most of the team never thought about at 3 AM.

Investigation

At 3:15 AM I pulled the consumer group lag metrics from our Kafka monitoring dashboard. What I saw made my stomach drop.

CONSUMER GROUP: payments-webhook-processor
TOPIC: payment.completed  PARTITIONS: 12

                     BEFORE INCIDENT        AT 3:00 AM
                     ─────────────────────────────────
Partition  0         offset 1,847,204       offset 1,845,901  ← WENT BACK
Partition  1         offset 1,849,012       offset 1,845,644  ← WENT BACK
Partition  2         offset 1,848,731       offset 1,845,902  ← WENT BACK
...
Partition 11         offset 1,847,890       offset 1,846,003  ← WENT BACK

LAG (total):         0 messages             2,341 messages
EARLIEST OFFSET:     1,845,600 (approx)    ← all partitions reset to ~here

Every partition had its offset reset backwards by roughly 1,200 to 1,400 messages. The consumer group didn't crash. It didn't reconnect. Someone had explicitly reset the offsets.

I searched the audit log for the Kafka admin UI we use internally. Nothing. I checked our runbook Confluence pages for scheduled maintenance. Nothing. Then I opened #infra-ops Slack and searched for "kafka". There it was: a message from our platform engineer at 2:44 AM.

"Running offset reset on the payments-webhook-processor group to fix the lag on staging. Forgot I was in the prod context 😬"

He'd run the following command intending to target staging:

bash

kafka-consumer-groups.sh   --bootstrap-server kafka-prod.internal:9092   --group payments-webhook-processor   --topic payment.completed   --reset-offsets   --to-earliest   --execute

--to-earliest reset every partition to the earliest retained offset, about 1,200 messages back, covering the last 3 hours of production traffic. The consumer picked up exactly where it was told and faithfully reprocessed everything.

Root cause

Two failures compounded each other:

Wrong Kafka bootstrap server in the terminal context. Our engineer had been debugging a lag issue on staging earlier that night. He closed the staging terminal, opened a new one, and pasted the reset command without noticing the bootstrap server URL in his clipboard pointed at kafka-prod.internal instead of kafka-staging.internal. The command ran in under a second with no confirmation prompt.

Idempotency checked too late. The webhook dispatcher flow looked like this.

BEFORE FIX (broken flow):

  Kafka Event
      │
      ▼
  Enrich payload
      │
      ▼
  Dispatch webhook  ← fires HTTP request HERE
      │
      ▼
  INSERT INTO payment_events_processed (event_id)
  ON CONFLICT DO NOTHING                          ← idempotency check AFTER dispatch
      │
      ▼
  Commit Kafka offset

AFTER FIX (correct flow):

  Kafka Event
      │
      ▼
  SELECT 1 FROM payment_events_processed
  WHERE event_id = ${eventId}
      │
  Already processed? → SKIP → Commit offset
      │
      ▼ (not processed)
  Enrich payload
      │
      ▼
  BEGIN TRANSACTION
    INSERT INTO payment_events_processed
    Dispatch webhook (within tx timeout)
  COMMIT
      │
      ▼
  Commit Kafka offset

The idempotency key was there, but it was written after the side effect, not checked before it. Under normal operation this never mattered because we never replayed events. The offset reset turned a theoretical weakness into a $183k incident.

The fix

We worked in parallel on two tracks: immediate remediation and structural hardening.

Immediate, night of incident:

Killed all consumer pods at 3:09 AM, stopping the bleeding at 2,340 duplicate events.
Manually reset offsets forward to the pre-incident position using the offset coordinates from our monitoring snapshot. We had per-partition offset metrics with 30-second granularity, which turned out to be the difference between "recoverable" and "ruined".
Restarted consumers at 3:31 AM. 44 minutes of downtime for the webhook service.
Worked with the payment gateway to void the 2,340 duplicate charges. All but 17 were voided within 6 hours. The remaining 17 required manual customer refunds.

Structural, next sprint:

typescript — idempotency check BEFORE dispatch

async function processPaymentEvent(event: PaymentEvent): Promise {
  // Check FIRST — before any side effects
  const alreadyProcessed = await db.queryOne(
    'SELECT 1 FROM payment_events_processed WHERE event_id = ${1}',
    [event.id]
  );

  if (alreadyProcessed) {
    logger.info({ eventId: event.id }, 'Skipping duplicate event');
    return; // Kafka offset will still be committed
  }

  const payload = await enrichPayload(event);

  // Transactional: record + dispatch atomically
  await db.transaction(async (tx) => {
    await tx.execute(
      'INSERT INTO payment_events_processed (event_id, processed_at) VALUES (${1}, NOW())',
      [event.id]
    );
    await dispatchWebhook(payload); // throws on failure → tx rolls back
  });
}

We also added operational guards so that offset resets in production require a deliberate multi-step process.

bash — wrapper script: kafka-reset-offsets.sh

#!/usr/bin/env bash
set -euo pipefail

ENVIRONMENT=${1:?Usage: kafka-reset-offsets.sh    }
GROUP=${2:?}
TOPIC=${3:?}
STRATEGY=${4:?}  # e.g. --to-offset 1847204 or --to-datetime 2026-03-21T02:00:00.000

if [[ "$ENVIRONMENT" == "prod" ]]; then
  echo "⚠️  PRODUCTION offset reset requested."
  echo "Group:    $GROUP"
  echo "Topic:    $TOPIC"
  echo "Strategy: $STRATEGY"
  echo ""
  read -p "Type 'RESET PRODUCTION $GROUP' to confirm: " CONFIRM
  if [[ "$CONFIRM" != "RESET PRODUCTION $GROUP" ]]; then
    echo "Confirmation failed. Aborting."
    exit 1
  fi
fi

kafka-consumer-groups.sh   --bootstrap-server "kafka-${ENVIRONMENT}.internal:9092"   --group "$GROUP" --topic "$TOPIC"   --reset-offsets $STRATEGY --execute

Lessons learned

Idempotency must be a pre-condition, not a post-record. If the idempotency key is written after the side effect fires, it only prevents duplicate storage, not duplicate action. The check has to happen before any externally visible work. This applies to webhooks, emails, charge APIs, anything with observable side effects.

Kafka offset resets have no undo. Unlike most database operations, an offset reset is immediate and committed the moment the command runs. No transaction to roll back. If you don't have per-partition offset metrics stored externally, you may not even know where to reset back to. We now ship partition offset snapshots to our metrics store every 30 seconds specifically for this recovery path.

Multi-environment terminal sessions are accidents waiting to happen. Production Kafka access now requires going through a dedicated jump host with a red terminal prompt and a mandatory KAFKA_ENV=prod variable that wrapper scripts validate. Staging is green. You feel the difference, which sounds silly until you've been up at 3 AM.

Operational blast radius should be bounded by design. Our webhook consumer had no rate limiter. When it got 1,200 extra events in its backlog, it processed them at full speed: 847 events in 90 seconds. A configurable MAX_EVENTS_PER_MINUTE guard plus an alert on abnormal throughput spikes would have fired before a single duplicate webhook left our system.

The audit trail gap cost us an hour. 60 minutes debugging application code when the answer was a one-line Slack message. We now pipe all kafka-consumer-groups.sh invocations through a wrapper that logs to our incident management system with operator identity, command, and timestamp, before execution.

The $183k was recovered. The 17 manual refunds were embarrassing but manageable. The real cost was the 22 minutes between first alert and containment, plus the three weeks it took to rebuild merchant trust with three integrations that had to handle the duplicate charge reversal on their end.