How AWS SQS Visibility Timeout Caused the Same Order to Be Processed 847 Times
Monday morning, 7:42 AM. Our on-call engineer gets paged: "Multiple customers reporting duplicate charges." We pull up the dashboard. Order count is 3× what it should be. Stripe shows the same order IDs appearing multiple times. In under four hours, 847 orders had been processed twice — some customers billed three times. We had a queue. We had idempotency keys. We thought we were safe.
Production Failure
Our e-commerce checkout pipeline looked textbook clean. The checkout service pushed order messages to an AWS SQS
queue. A fleet of Node.js workers polled the queue, processed each order — database writes, Stripe charge,
confirmation email — then deleted the message. At-least-once delivery was handled with an idempotency key:
a unique order_id stored in PostgreSQL. Before touching Stripe, every worker ran:
async function processOrder(orderId: string, payload: OrderPayload) {
// Idempotency check — skip if already processed
const existing = await db.query(
'SELECT processed FROM orders WHERE order_id = $1',
[orderId]
);
if (existing.rows[0]?.processed) {
console.log(`Order ${orderId} already processed, skipping`);
return;
}
// Process: charge Stripe, write DB, send email (~35–45s total)
const charge = await stripe.charges.create({ amount: payload.amount, currency: 'usd', source: payload.token });
await db.query('INSERT INTO orders (order_id, charge_id, processed) VALUES ($1, $2, true)', [orderId, charge.id]);
await sendConfirmationEmail(payload.email, orderId);
// Delete the SQS message — but are we still within the visibility window?
await sqs.deleteMessage({ QueueUrl: QUEUE_URL, ReceiptHandle: payload.receiptHandle }).promise();
}
This looked safe. It wasn't. The bug wasn't in the code — it was in the interaction between our processing latency and a single SQS configuration value we'd never thought to question.
False Assumptions
We had made three compounding assumptions, each reasonable in isolation, lethal in combination:
- Idempotency check is sufficient. We assumed a pre-flight SELECT would catch duplicates. It would — if only one worker ever ran the check at a time.
- Workers are fast enough. In development, order processing took 8–12 seconds. We never profiled it under production load with real Stripe latency and a busy PostgreSQL cluster.
- SQS visibility timeout is a safety net, not a deadline. We set it to 30 seconds and forgot about it. It is, in fact, a hard deadline with a sharp edge.
The SQS documentation is explicit: if a consumer does not delete a message before the visibility timeout expires, SQS makes the message visible again for redelivery. We knew this. We just never connected it to our processing latency.
Reproducing the Bottleneck
Before diving into CloudWatch, we timed actual order processing in production over 200 samples:
Production was 3–4× slower than staging. The Stripe API added 8–15 seconds. Our PostgreSQL cluster, under real load, added another 10–12 seconds for write-heavy transactions. The confirmation email (SendGrid) added 5–8 more. Total: routinely 40+ seconds — well past our 30-second timeout.
BEFORE: Visibility Timeout Race Condition
─────────────────────────────────────────────────────────────────────
SQS Queue
┌─────────────────────────────────────────────────────────┐
│ [Order #847] visibility_timeout = 30s │
└─────────────────────────────────────────────────────────┘
│
│ t=0s: Worker A pulls message, message hidden
▼
┌──────────────┐
│ Worker A │ t=0s: SELECT processed → false ✓
│ │ t=5s: Stripe charge begins
│ │ t=20s: Stripe responds
│ PROCESSING │ t=28s: DB write begins
│ │
└──────────────┘
·
· t=30s: VISIBILITY TIMEOUT EXPIRES
· Message becomes visible again!
·
▼
┌──────────────┐
│ Worker B │ t=31s: Pulls same message
│ │ t=31s: SELECT processed → false ✓ (Worker A hasn't committed yet)
│ │ t=36s: Stripe charge begins — DUPLICATE CHARGE
│ PROCESSING │
│ │
└──────────────┘
·
· t=38s: Worker A commits, deletes message
· t=42s: Worker B commits, deletes message (receipt handle still valid)
· Two Stripe charges. One customer. One very angry morning.
Investigation: The Sawtooth in CloudWatch
Once we knew what to look for, the CloudWatch metrics told the whole story. The
ApproximateNumberOfMessagesNotVisible metric — messages currently being processed —
showed a distinctive sawtooth pattern: flat for ~30 seconds, then a spike as messages re-emerged,
then flat again. Every 30 seconds, like clockwork.
We also pulled NumberOfMessagesSent vs NumberOfMessagesDeleted over the
four-hour incident window. Messages deleted exceeded messages sent by 847 — one extra delete per
duplicated order, as both Worker A and Worker B each deleted the same message after processing it.
SQS silently accepts deletes for already-deleted messages (it's idempotent at the delete level,
just not at the consume level). Zero errors. Zero alarms. Just silent double-processing.
Root Cause: TOCTOU + Clock Mismatch
Two distinct failure modes compounded each other:
1. Time-of-Check / Time-of-Use (TOCTOU) race: The idempotency check
(SELECT processed WHERE order_id = ?) and the write (INSERT ... processed = true)
were not atomic. Two workers could both pass the check in the gap between when either committed.
Under normal circumstances this window is milliseconds. With a 30-second visibility timeout expiry,
the window was guaranteed to be hit every time processing ran long.
2. Latency budget never measured: We set the visibility timeout once, in a Terraform file, 18 months before the incident. Nobody had re-examined it as the order pipeline grew — more validation logic, a third-party fraud check, an email delivery confirmation step. The timeout was a number on a line nobody touched.
Architecture Fix
We needed two independent fixes. Either alone was insufficient — the timeout fix alone wouldn't prevent a race on worker restarts; the lock fix alone didn't prevent the timeout from triggering unnecessary redeliveries.
Fix 1 — Heartbeat: Extend visibility timeout mid-flight
While a worker is processing, a background interval extends the message's visibility timeout every 20 seconds. The message stays invisible until the worker explicitly finishes or crashes.
async function processOrderSafely(message: SQS.Message, payload: OrderPayload) {
const orderId = payload.orderId;
// --- FIX 1: Heartbeat — keep message invisible while we work ---
const heartbeat = setInterval(async () => {
try {
await sqs.changeMessageVisibility({
QueueUrl: QUEUE_URL,
ReceiptHandle: message.ReceiptHandle!,
VisibilityTimeout: 60, // extend by 60s on each heartbeat
}).promise();
} catch (err) {
// Message may have already been deleted — safe to ignore
console.warn('Heartbeat extension failed (message already gone?)', err);
}
}, 20_000); // every 20s, well within the 60s window
try {
// --- FIX 2: Atomic distributed lock via PostgreSQL advisory lock ---
// pg_try_advisory_xact_lock takes a bigint; hash the orderId to one
const lockKey = hashOrderId(orderId); // stable integer hash
const lockResult = await db.query(
'SELECT pg_try_advisory_xact_lock($1) AS acquired',
[lockKey]
);
if (!lockResult.rows[0].acquired) {
console.log(`Order ${orderId} locked by another worker — skipping`);
return; // another worker has the lock; let it finish
}
// --- FIX 3: Atomic upsert — insert only if not already processed ---
const insert = await db.query(`
INSERT INTO orders (order_id, status, created_at)
VALUES ($1, 'processing', NOW())
ON CONFLICT (order_id) DO NOTHING
RETURNING order_id
`, [orderId]);
if (insert.rowCount === 0) {
console.log(`Order ${orderId} already exists — idempotent skip`);
return; // row was already there; genuine duplicate, safe to skip
}
// Safe to charge — lock held, row owned, no other worker can get here
const charge = await stripe.charges.create({
amount: payload.amount,
currency: 'usd',
source: payload.token,
idempotency_key: `order-${orderId}`, // Stripe-level idempotency as last resort
});
await db.query(
'UPDATE orders SET status = $1, charge_id = $2 WHERE order_id = $3',
['completed', charge.id, orderId]
);
await sendConfirmationEmail(payload.email, orderId);
await sqs.deleteMessage({ QueueUrl: QUEUE_URL, ReceiptHandle: message.ReceiptHandle! }).promise();
} finally {
clearInterval(heartbeat); // always stop the heartbeat
// pg advisory lock releases automatically at transaction end
}
}
The three-layer defense now ensures exactly-once semantics even under concurrent workers and long processing times:
AFTER: Heartbeat + Distributed Lock + Atomic Upsert
────────────────────────────────────────────────────────────────────
SQS Queue
┌──────────────────────────────────────────────────────────────┐
│ [Order #847] initial visibility_timeout = 30s │
└──────────────────────────────────────────────────────────────┘
│
│ t=0s: Worker A pulls message
▼
┌──────────────────────────────────────────────┐
│ Worker A │
│ │
│ t=0s: pg_try_advisory_xact_lock → true ✓ │
│ t=0s: INSERT ... ON CONFLICT DO NOTHING │
│ → 1 row inserted ✓ │
│ │
│ ┌── Heartbeat (every 20s) ──────────────┐ │
│ │ t=20s: extend visibility → +60s ✓ │ │
│ │ t=40s: extend visibility → +60s ✓ │ │
│ └───────────────────────────────────────┘ │
│ │
│ t=42s: Stripe charge complete │
│ t=44s: DB update, email sent │
│ t=44s: sqs.deleteMessage() ✓ │
│ t=44s: clearInterval(heartbeat) │
└──────────────────────────────────────────────┘
If Worker B somehow pulls the same message (e.g., on restart):
┌──────────────────────────────────────────────┐
│ Worker B │
│ pg_try_advisory_xact_lock → false ✗ │
│ → returns immediately, no Stripe call │
└──────────────────────────────────────────────┘
Result: Zero duplicate charges. Message processed exactly once.
Results After the Fix
We also added a CloudWatch alarm on the ratio of NumberOfMessagesDeleted to
NumberOfMessagesSent. Anything over 1.05 (5% tolerance for legitimate retries)
now pages the on-call engineer immediately.
Lessons Learned
- SQS visibility timeout is a hard deadline, not a suggestion. Treat it like a maximum processing budget. If you can't guarantee completion within it, you must use heartbeats. Rule of thumb: set initial timeout to 2× your p50 latency; extend via heartbeat for anything longer.
-
Idempotency checks must be atomic.
A SELECT followed by an INSERT is a TOCTOU race. Use
INSERT ... ON CONFLICT DO NOTHINGand checkrowCount. If it's 0, another worker already claimed this work. - At-least-once delivery requires true idempotency, not just a guard clause. "We have idempotency keys" is only true if the write that sets the key and the side effect it guards are in the same atomic operation — or protected by a distributed lock.
- Configuration values drift away from reality silently. That 30-second timeout was set when the pipeline took 10 seconds. Nobody re-examined it as we added fraud checks, email confirmations, and retry logic. Add latency budget reviews to your architecture change process, not just code review.
-
Stripe-level idempotency is a last resort, not first-line defense.
We added
idempotency_keyto our Stripe calls as a third layer. Stripe deduplicates identical keys within 24 hours. But don't rely on your payment provider to catch what your architecture should have prevented.
— Darshan Turakhia · Full Stack Engineer · March 2026