How a Missing Idempotency Key Charged 12,000 Users Twice in 4 Minutes
At 11:47 PM on a Friday, our on-call Slack channel lit up with a single message: "Users are getting charged twice." Within 4 minutes, 12,000 customers had been billed a second time. $47,000 in duplicate charges. Support tickets were flooding in before we could even open our dashboards.
Production Failure
We ran a subscription-based SaaS product with a mobile app (iOS and Android) handling in-app purchases and plan upgrades. The payment flow hit our FastAPI backend, which created a Stripe payment intent and confirmed the charge.
That Friday night, a brief AWS us-east-1 latency spike hit around 11:45 PM. Nothing catastrophic — just 8 seconds of elevated response times on our payment service. But those 8 seconds were enough to trigger our mobile client's retry logic, and what followed was a textbook API design failure we'd been one bad latency spike away from for two years.
Refunds were issued within 2 hours. The financial damage was recoverable. The trust damage took longer.
False Assumptions
The first theory was Stripe webhook replay. Our webhook processor handles payment_intent.succeeded events — maybe a duplicate event slipped through? We pulled the Stripe dashboard and ruled it out immediately: each affected user had two distinct payment intent IDs created seconds apart. Stripe wasn't replaying anything. We were creating the intents twice.
The second theory was a frontend double-submit bug. Maybe the mobile client was submitting the payment form twice on a rapid button tap? We checked the mobile release deployed that week — no changes to the payment flow, no button debouncing regressions. And the request timestamps didn't match a double-tap pattern. They were 15–30 seconds apart, every single time.
That gap — 15 to 30 seconds — was the clue we missed for 40 minutes.
Profiling the Request Trail
We pulled API gateway logs for POST /api/v1/payments/charge between 11:45 and 11:52 PM. The pattern was unmistakable:
API Gateway Log — POST /api/v1/payments/charge (11:46–11:49 PM)
user_id=83741 11:46:03.211 200 OK (22.4s response time)
user_id=83741 11:46:18.744 200 OK ( 8.1s response time) ← RETRY
user_id=91052 11:46:04.889 200 OK (19.8s response time)
user_id=91052 11:46:19.901 200 OK ( 7.2s response time) ← RETRY
user_id=77130 11:46:05.441 200 OK (21.1s response time)
user_id=77130 11:46:20.503 200 OK ( 6.4s response time) ← RETRY
Pattern: first request exceeded client timeout (15s threshold)
server still processed and charged — response never arrived
second request succeeded — creating a SECOND Stripe charge
The mobile client had a hardcoded 15-second network timeout. Under normal conditions our payment endpoint responded in 3–6 seconds. During the AWS latency spike, response times ballooned to 18–24 seconds — just enough to cross the timeout threshold for thousands of concurrent users.
The mobile HTTP library had automatic retry-on-timeout enabled by default with a single retry attempt. The first request actually succeeded on the server — Stripe processed the charge — but the response never reached the client before it abandoned the connection. The client retried. The server processed a second, independent charge with no knowledge of the first. Both succeeded. Both users were billed.
Without Idempotency Key — The Double-Charge Flow
Mobile Client API Server Stripe
| | |
|--- POST /charge --------->| |
| |--- CreateIntent -->|
| [15s timeout fires] | [processing...] |
| [client abandons] ✗ | |
| |<-- Intent OK ------|
| | [response lost] |
| | |
|--- POST /charge (retry) ->| |
| |--- CreateIntent -->| ← 2nd charge!
| |<-- Intent OK ------|
|<-- 200 OK -----------------| |
| | |
User sees 1 charge. Server processed 2. Stripe billed twice.
Root Cause
The root cause was a non-idempotent POST endpoint on a state-mutating financial operation. Our /payments/charge handler had no mechanism to detect or reject duplicate requests. Every call created a fresh Stripe payment intent regardless of whether an identical request had been processed moments earlier.
Three compounding factors aligned that night:
- Timeout miscalibration: The mobile client timeout (15s) was shorter than our p99 payment latency under load (22s). This wasn't a known gap — we'd never measured p99 under concurrent load before.
- Silent retry: The HTTP library retried on timeout by default. There was no documentation in our codebase noting this behavior. A new engineer had integrated it 8 months earlier without flagging it.
- No deduplication layer: The FastAPI endpoint had no request fingerprinting, no idempotency key check, not even a basic client-generated request ID field in the schema. Every POST was treated as a novel intent.
We had been lucky for two years. Normal payment latency (3–6s) kept us well below the 15-second cliff. The AWS spike was the first time production conditions exposed the gap at scale.
Architecture Fix
The fix required coordinated changes at three layers: client generation, server enforcement, and infrastructure observability.
Layer 1 — Client: UUID per payment attempt. Before initiating any payment request, the mobile app now generates a X-Idempotency-Key UUID and stores it in local state. On retry, the same key is sent. The key is cleared only after a confirmed server success or explicit user cancellation — never on timeout alone.
Layer 2 — Server: Redis deduplication before Stripe. The FastAPI endpoint checks Redis for the idempotency key before touching Stripe. If the key exists, it returns the cached response immediately — same payload, zero additional charges. If the key is new, it processes normally, stores the result with a 24-hour TTL, then responds.
import uuid
import json
from fastapi import Header, HTTPException, Depends
from redis.asyncio import Redis
async def charge_payment(
payload: ChargeRequest,
x_idempotency_key: str = Header(...),
redis: Redis = Depends(get_redis),
):
if not is_valid_uuid(x_idempotency_key):
raise HTTPException(400, "Invalid idempotency key format")
cache_key = f"idem:{x_idempotency_key}"
# Return cached result for duplicate requests
cached = await redis.get(cache_key)
if cached:
return json.loads(cached)
# Forward same idempotency key to Stripe — prevents double-charge
# even if our Redis write fails after Stripe succeeds
result = await stripe_client.create_and_confirm_intent(
amount=payload.amount_cents,
currency=payload.currency,
customer_id=payload.stripe_customer_id,
idempotency_key=x_idempotency_key, # critical
)
response = {
"payment_intent_id": result.id,
"status": result.status,
"amount": result.amount,
}
# Cache with 24-hour TTL — covers all realistic retry windows
await redis.setex(cache_key, 86400, json.dumps(response))
return response
The Redis lookup adds roughly 1–2ms of overhead on the happy path — completely negligible against a payment flow that takes 3–6 seconds under normal conditions. We also forward the same idempotency key to Stripe directly, adding a second deduplication layer at the gateway level in case our Redis write ever fails after Stripe succeeds.
Layer 3 — Observability: Alert before clients time out. We added a p99 latency alarm on the payment service that fires at 20 seconds — 25 seconds below the new 45-second client timeout. The goal is to catch degradation before retry conditions can occur, not just after the damage is done.
With Idempotency Key — Safe Retry Flow
Mobile Client API Server Redis Stripe
| | | |
| X-Idempotency-Key: abc123 | | |
|--- POST /charge --------->| | |
| |-- GET idem:abc123 -------->|
| |<-- (nil) ---| |
| |--- CreateIntent (key=abc) ->|
| [timeout — retry] | | [processing]|
| |<-- OK ----------------------|
| |-- SET idem:abc, result --->|
| | | |
| X-Idempotency-Key: abc123 | | |
|--- POST /charge (retry) ->| | |
| |-- GET idem:abc123 -------->|
| |<-- (cached!) | |
|<-- 200 (cached response) -| | |
| | | |
One charge. One intent. Stripe billed once. Client satisfied.
Lessons Learned
- Measure p99 under load, not p50 at rest. Our 15-second timeout looked safe against a 3-second average. It was not safe against a 22-second p99 during a 2,000 req/s payment spike. We now benchmark p95 and p99 for every external-facing endpoint and document them alongside timeout configuration.
- Retry logic and idempotency are a package deal. Any HTTP client that retries on failure must send idempotency keys. We added a CI lint rule in our mobile codebase: any POST to a payment, notification, or order endpoint without an
X-Idempotency-Keyheader is a build error. - Forward idempotency keys to downstream services. Our Redis cache deduplicates at the API layer, but Stripe also supports idempotency keys natively. Forwarding the same key to Stripe adds a second safety net for the race condition where our Redis write fails after a successful Stripe charge.
- Alert below your timeout thresholds. A p99 latency alarm at 20 seconds (below the 45-second client timeout) gives us a 25-second window to intervene before retry storms become possible. Previously we had no payment latency alarm at all.
The refunds landed in 2 hours. The architectural fix shipped 48 hours later. The real lesson wasn't about idempotency keys — it was that two years of fast payment responses had masked a structural flaw in our API contract. One AWS hiccup was all it took.
— Built from a real production incident. Dollar figures and user counts are approximate but structurally accurate.