One kafka-consumer-groups.sh Command Sent $180k in Duplicate Payments
It was 2:47 AM when the first Slack alert fired. Then another. Then eleven more in under a minute. Our payments dashboard was showing duplicate charges — real money, real customers, real chaos. Within four minutes I was staring at a terminal trying to figure out how a system that had been running flawlessly for nine months had suddenly decided to reprocess every payment event from the past three hours.
Production Failure
The alert read: "Duplicate payment webhook detected — 847 events in 90 seconds."
Our payment processing service consumed events from a Kafka topic called payment.completed,
enriched them, and fired webhooks to merchant integrations. Under normal load we saw roughly
40–60 events per minute. Now we were seeing 847 in 90 seconds — roughly 14× our peak throughput.
First instinct: the payment gateway was retrying. I checked their status page — all green. I checked our idempotency table — keys were present, so the DB layer was rejecting duplicates at the application level. But the webhooks had already fired. The merchant callbacks had already gone out. Stripe charges had already been attempted.
By 3:05 AM we had counted 2,340 duplicate webhook deliveries, representing $183,200 in double-charged transactions. We killed the consumer service at 3:09 AM — 22 minutes after the first alert.
False Assumptions
Our first hour was wasted chasing the wrong culprits:
- The payment gateway — clean status page, no retry storms on their end.
- Our idempotency layer — keys existed in
payment_events_processedbut the check happened after the webhook was dispatched, not before. A race condition we'd never triggered under normal load. - A bad deploy — no deploy had happened in 11 hours. Git blame was clean.
- The consumer crashing and re-subscribing — the pod logs showed no restarts, no exceptions, no panics. The consumer was healthy and happy, merrily processing events.
We were looking at the wrong layer entirely. The issue wasn't in the consumer code. It wasn't in the application logic. It was in the Kafka offset metadata — a layer most of the team never thought about at 3 AM.
Investigation
At 3:15 AM I pulled the consumer group lag metrics from our Kafka monitoring dashboard. What I saw made my stomach drop.
CONSUMER GROUP: payments-webhook-processor
TOPIC: payment.completed PARTITIONS: 12
BEFORE INCIDENT AT 3:00 AM
─────────────────────────────────
Partition 0 offset 1,847,204 offset 1,845,901 ← WENT BACK
Partition 1 offset 1,849,012 offset 1,845,644 ← WENT BACK
Partition 2 offset 1,848,731 offset 1,845,902 ← WENT BACK
...
Partition 11 offset 1,847,890 offset 1,846,003 ← WENT BACK
LAG (total): 0 messages 2,341 messages
EARLIEST OFFSET: 1,845,600 (approx) ← all partitions reset to ~here
Every partition had its offset reset backwards by approximately 1,200–1,400 messages. The consumer group didn't crash. It didn't reconnect. Someone had explicitly reset the offsets.
I searched the audit log for the Kafka admin UI we use internally. Nothing. I checked our
runbook Confluence pages for any scheduled maintenance. Nothing. Then I opened the
#infra-ops Slack channel and searched for "kafka". There it was — a message
from our platform engineer at 2:44 AM:
"Running offset reset on the payments-webhook-processor group to fix the lag on staging. Forgot I was in the prod context 😬"
He'd run the following command intending to target staging:
kafka-consumer-groups.sh --bootstrap-server kafka-prod.internal:9092 --group payments-webhook-processor --topic payment.completed --reset-offsets --to-earliest --execute
--to-earliest reset every partition to the earliest retained offset — about
1,200 messages back, covering the last 3 hours of production traffic. The consumer picked up
exactly where it was told to start and faithfully reprocessed everything.
Root Cause
Two failures compounded each other:
1. Wrong Kafka bootstrap server in the terminal context. Our engineer had
been debugging a lag issue on staging earlier that night. He closed his staging terminal
session, opened a new one, and copy-pasted the reset command without noticing the bootstrap
server URL in his clipboard pointed to kafka-prod.internal instead of
kafka-staging.internal. The command ran in under a second with no confirmation
prompt.
2. Idempotency checked too late. Our webhook dispatcher had this flow:
BEFORE FIX (broken flow):
Kafka Event
│
▼
Enrich payload
│
▼
Dispatch webhook ← fires HTTP request HERE
│
▼
INSERT INTO payment_events_processed (event_id)
ON CONFLICT DO NOTHING ← idempotency check AFTER dispatch
│
▼
Commit Kafka offset
AFTER FIX (correct flow):
Kafka Event
│
▼
SELECT 1 FROM payment_events_processed
WHERE event_id = ${eventId}
│
Already processed? → SKIP → Commit offset
│
▼ (not processed)
Enrich payload
│
▼
BEGIN TRANSACTION
INSERT INTO payment_events_processed
Dispatch webhook (within tx timeout)
COMMIT
│
▼
Commit Kafka offset
The idempotency key was there, but it was written after the side effect, not checked before it. Under normal operation this never mattered because we never replayed events. The offset reset turned our theoretical weakness into a $183k incident.
The Fix
We worked in parallel on two tracks: immediate remediation and structural hardening.
Immediate (night of incident):
- Killed all consumer pods at 3:09 AM, stopping the bleeding at 2,340 duplicate events.
- Manually reset offsets forward to the pre-incident position using the offset coordinates from our monitoring snapshot (we had per-partition offset metrics with 30-second granularity).
- Restarted consumers at 3:31 AM — 44 minutes of downtime for the webhook service.
- Worked with the payment gateway to void the 2,340 duplicate charges. All but 17 were successfully voided within 6 hours. The remaining 17 required manual customer refunds.
Structural (next sprint):
async function processPaymentEvent(event: PaymentEvent): Promise {
// Check FIRST — before any side effects
const alreadyProcessed = await db.queryOne(
'SELECT 1 FROM payment_events_processed WHERE event_id = ${1}',
[event.id]
);
if (alreadyProcessed) {
logger.info({ eventId: event.id }, 'Skipping duplicate event');
return; // Kafka offset will still be committed
}
const payload = await enrichPayload(event);
// Transactional: record + dispatch atomically
await db.transaction(async (tx) => {
await tx.execute(
'INSERT INTO payment_events_processed (event_id, processed_at) VALUES (${1}, NOW())',
[event.id]
);
await dispatchWebhook(payload); // throws on failure → tx rolls back
});
}
We also added three operational guards to prevent the offset reset from being possible again without a deliberate multi-step process:
#!/usr/bin/env bash
set -euo pipefail
ENVIRONMENT=${1:?Usage: kafka-reset-offsets.sh }
GROUP=${2:?}
TOPIC=${3:?}
STRATEGY=${4:?} # e.g. --to-offset 1847204 or --to-datetime 2026-03-21T02:00:00.000
if [[ "$ENVIRONMENT" == "prod" ]]; then
echo "⚠️ PRODUCTION offset reset requested."
echo "Group: $GROUP"
echo "Topic: $TOPIC"
echo "Strategy: $STRATEGY"
echo ""
read -p "Type 'RESET PRODUCTION $GROUP' to confirm: " CONFIRM
if [[ "$CONFIRM" != "RESET PRODUCTION $GROUP" ]]; then
echo "Confirmation failed. Aborting."
exit 1
fi
fi
kafka-consumer-groups.sh --bootstrap-server "kafka-${ENVIRONMENT}.internal:9092" --group "$GROUP" --topic "$TOPIC" --reset-offsets $STRATEGY --execute
Lessons Learned
1. Idempotency must be a pre-condition, not a post-record. If your idempotency key is written after the side effect fires, it only prevents duplicate storage — not duplicate action. The check must happen before any externally visible work. This is true for webhooks, emails, charge APIs, anything with observable side effects.
2. Kafka offset resets have no undo. Unlike most database operations, a Kafka offset reset is immediate and committed the moment the command runs. There is no transaction to roll back. If you don't have per-partition offset metrics stored externally, you may not even know where to reset back to. We now ship partition offset snapshots to our metrics store every 30 seconds specifically for this recovery path.
3. Multi-environment terminal sessions are accidents waiting to happen.
We now enforce that production Kafka access requires going through a dedicated jump host with
a red-coloured terminal prompt and a mandatory KAFKA_ENV=prod environment
variable that all wrapper scripts validate. Staging uses green. You feel the difference.
4. Operational blast radius should be bounded by design.
Our webhook consumer had no rate limiter. When it got 1,200 extra events in its backlog, it
processed them at full speed — 847 events in 90 seconds. Had we had a configurable
MAX_EVENTS_PER_MINUTE guard, an alert on abnormal throughput spikes could have
fired before a single duplicate webhook left our system.
5. The audit trail gap cost us an hour.
We spent 60 minutes debugging application code when the answer was a one-line Slack message.
We now pipe all kafka-consumer-groups.sh invocations through a wrapper that
logs to our incident management system with the operator's identity, the command, and a
timestamp — before execution.
The $183k was recovered. The 17 manual refunds were embarrassing but manageable. The real cost was the 22 minutes between first alert and containment, and the three weeks it took to rebuild merchant trust for three integrations that had to handle the duplicate charge reversal on their end.
One command. Twenty-two minutes. Two operational antipatterns that had coexisted harmlessly for nine months until a tired engineer fat-fingered a bootstrap server URL at 2:44 AM. Write your idempotency checks first. Guard your offset resets. And for the love of everything, colour-code your production terminals.