Our OpenAI Bill Went From $23 to $4,200 in 48 Hours — A Missing Stop Sequence Did It
On the Monday after a long weekend, our Head of Engineering opened the OpenAI billing dashboard and thought the page had loaded incorrectly. The previous month's bill was $23. The current month's running total was $4,218.40 — accumulated over 48 hours. We had processed 40.3 million tokens across 14,000 API calls while everyone was away. The model wasn't doing useful work. It was caught in a generation loop, producing output that fed back into itself, indefinitely, at $0.03 per 1,000 tokens.
The Pipeline: Feedback Categorisation at Scale
We'd built a background pipeline to process user feedback submissions — bug reports, feature requests, NPS responses — and automatically categorise them, extract action items, and route them to the right team channel. It used GPT-4 Turbo via the OpenAI API, ran as a queue consumer in AWS SQS, and had been working reliably for two months.
The prompt was roughly:
const prompt = `
You are a product feedback analyst. Given the following user feedback,
output a JSON object with:
- category: one of [bug, feature, question, complaint, praise]
- severity: one of [critical, high, medium, low]
- summary: a 1-2 sentence summary
- actionItems: array of specific action items for the product team
- sentiment: score from -1.0 to 1.0
User feedback:
${feedbackText}
Respond with only the JSON object, no markdown fencing.
`;
For two months, this worked perfectly. The JSON came back clean, parsed correctly, routed correctly. Then we shipped a change.
The Change That Broke Everything
A product manager asked for one addition: a suggestedResponse field —
a draft reply we could send back to the user. The prompt was updated to include it.
We tested it manually with 10 feedback samples. All 10 worked. We deployed on Friday
afternoon.
The new field was described as: "suggestedResponse: a friendly, empathetic response to send to the user, acknowledging their feedback and describing next steps."
What we didn't realise: for certain long, emotional feedback submissions — particularly
NPS detractor responses that ran to several paragraphs — the model would generate a
long suggestedResponse that itself contained user-like language. Our prompt
parser, looking for the closing } of the JSON, couldn't find it because
the suggestedResponse value contained unescaped curly braces from template
literals in the model's draft response text. The parse failed. The message went back
to the queue. The queue consumer retried it. GPT-4 was called again.
THE LOOP
─────────────────────────────────────────────────────────────
Long feedback message arrives in SQS
│
▼
GPT-4 called → generates suggestedResponse with {template} syntax
│
▼
JSON.parse() throws SyntaxError
(unescaped { } in suggestedResponse string value)
│
▼
Error handler: message returns to queue (visibility timeout: 30s)
│
▼
Consumer picks it up again after 30s
│
└──── back to top ──── (repeats forever)
SQS maxReceiveCount: not set (default: unlimited)
Dead letter queue: configured but wrong ARN (never received)
Alert threshold: $500 spend (never reached before weekend started)
14,000 API calls × avg 2,850 tokens each = 39.9M tokens
Cost: $4,218.40 over 48 hours
Why Our Safeguards All Failed Simultaneously
We had three things in place that should have caught this. None of them did.
1. SQS Dead Letter Queue. We had configured a DLQ, but when we'd migrated queue infrastructure two months earlier, the DLQ ARN in our Terraform config pointed to the old environment. Messages never reached the DLQ. We had never verified this after the migration because the happy path worked.
2. Spend Alerts. We'd set a CloudWatch billing alert at $500. The spend started Friday evening. By the time the alert would have fired, it was Sunday morning — except AWS billing alerts have a known lag of 6–12 hours due to usage aggregation delays. The alert fired at $3,847 on Sunday afternoon. No one saw it until Monday.
3. Error Rate Monitoring. We tracked the error rate of the feedback
pipeline. But the consumer was catching the JSON.parse exception internally
and returning the message to the queue — it wasn't surfaced as an application error.
From our metrics' perspective, the consumer was healthy: receiving messages, processing
them, no uncaught exceptions.
The Two Root Causes
When we did the post-mortem, we identified two independent root causes, either of which alone would have prevented the incident:
Root Cause 1: No stop sequence or max_tokens cap. Our API call had no
stop parameter and no max_tokens limit. The model could
generate arbitrarily long output. For the specific inputs that triggered the loop,
each call was generating ~4,200 tokens before the token window cut it off. With a
max_tokens: 800 cap (more than enough for our output), each failed call
would have cost 85% less and we'd have stayed under the alert threshold far sooner.
Root Cause 2: Retry logic with no ceiling and a broken DLQ.
SQS maxReceiveCount defaults to unlimited if the redrive policy is
misconfigured. A message that perpetually fails parse will perpetually retry.
The combination of infinite retry + broken DLQ + no per-message retry counter
meant there was no circuit breaker at any layer.
The Fixes
We shipped four changes the same day:
// 1. Always cap tokens
const response = await openai.chat.completions.create({
model: 'gpt-4-turbo',
messages: [{ role: 'user', content: prompt }],
max_tokens: 800, // hard cap
stop: ['}
', '}
'], // stop on JSON object close
temperature: 0.2, // reduce variability for structured output
response_format: { type: 'json_object' }, // enforce JSON mode
});
// 2. Track retry count per message, dead-letter after 3 attempts
const receiveCount = parseInt(message.Attributes?.ApproximateReceiveCount || '1');
if (receiveCount > 3) {
await sendToDeadLetter(message, 'max_retries_exceeded');
await deleteFromQueue(message);
return;
}
// 3. Validate DLQ ARN on startup
await validateQueueExists(process.env.DLQ_URL!);
// 4. Set spend alert at $50, not $500
// (done in AWS console + Terraform)
The response_format: { type: 'json_object' } change was the most impactful.
OpenAI's JSON mode guarantees valid JSON output — no unescaped curly braces, no markdown
fencing, no trailing text. This alone would have prevented the entire incident. We hadn't
used it because it was a newer API feature that wasn't in the documentation we'd referenced
when we originally built the pipeline.
What OpenAI Did
We contacted OpenAI support the same day. They reviewed the usage logs, confirmed the pattern was consistent with a runaway retry loop, and refunded $3,400 as a one-time goodwill credit. We were grateful, but we're under no illusion that this was guaranteed — their terms don't require it. The financial exposure was real.
Lessons
- Always set
max_tokens. Never let an LLM API call have unlimited output length, especially in automated pipelines. Calculate your maximum expected output and cap at 1.5x that. - Use
response_format: json_objectfor structured output. It's not just a convenience — it's a safety mechanism. Unstructured JSON parsing from free-form LLM output is a reliability anti-pattern. - Verify your dead letter queue actually works. Send a test message that will always fail and confirm it reaches the DLQ. Do this after every infrastructure migration.
- Spend alerts lag by hours. Set them at 10% of your pain threshold, not at the threshold itself. A $50 alert would have fired Sunday morning when someone could still act.
- LLM API cost is unbounded by default. Unlike compute costs that are capped by instance size, token costs scale linearly with runaway loops. Treat every LLM call as a potential infinite cost if your retry logic is broken.