March 17, 2026AI9 min read

Our OpenAI Bill Went From $23 to $4,200 in 48 Hours — A Missing Stop Sequence Did It

Published March 17, 20269 min read

Tuesday after a long weekend. Our Head of Engineering opens the OpenAI billing dashboard and assumes the page hasn't loaded properly. Previous month: $23. Current month, running total: $4,218.40. All of it accumulated over 48 hours while the office was empty.

14,000 API calls. 40.3 million tokens. The model wasn't doing anything useful with any of them. It was stuck in a generation loop, producing output that fed back into itself, indefinitely, at $0.03 per 1,000 tokens. I found out when I came in Monday morning and saw the Head of Engineering standing very still at her desk.

The pipeline: feedback categorisation at scale

We'd built a background pipeline to process user feedback submissions (bug reports, feature requests, NPS responses) and categorise them, pull out action items, and route them to the right team channel. GPT-4 Turbo via the OpenAI API, running as a queue consumer on SQS. It had been working fine for two months.

The prompt was roughly:

const prompt = `
You are a product feedback analyst. Given the following user feedback,
output a JSON object with:
- category: one of [bug, feature, question, complaint, praise]  
- severity: one of [critical, high, medium, low]
- summary: a 1-2 sentence summary
- actionItems: array of specific action items for the product team
- sentiment: score from -1.0 to 1.0

User feedback:
${feedbackText}

Respond with only the JSON object, no markdown fencing.
`;

For two months, this worked perfectly. The JSON came back clean, parsed correctly, routed correctly. Then we shipped a change.

The change that broke everything

A PM asked for one addition: a suggestedResponse field, a draft reply we could send back to the user. I updated the prompt, tested it manually with 10 feedback samples, all 10 worked. Deployed Friday afternoon. (You can already see where this is going.)

The new field was described as: "suggestedResponse: a friendly, empathetic response to send to the user, acknowledging their feedback and describing next steps."

Here's what I missed. For certain long, emotional feedback submissions, particularly NPS detractor rants that ran to several paragraphs, the model would produce a long suggestedResponse that itself contained user-quoted text. Our parser, looking for the closing } of the JSON, couldn't find it: the suggestedResponse value contained unescaped curly braces from template literals in the model's draft response text. Parse fails. Message goes back to the queue. Consumer picks it up. GPT-4 gets called again. And again.

  THE LOOP
  ─────────────────────────────────────────────────────────────
  
  Long feedback message arrives in SQS
       │
       ▼
  GPT-4 called → generates suggestedResponse with {template} syntax
       │
       ▼
  JSON.parse() throws SyntaxError
  (unescaped { } in suggestedResponse string value)
       │
       ▼
  Error handler: message returns to queue (visibility timeout: 30s)
       │
       ▼
  Consumer picks it up again after 30s
       │
       └──── back to top ──── (repeats forever)
  
  SQS maxReceiveCount: not set (default: unlimited)
  Dead letter queue: configured but wrong ARN (never received)
  Alert threshold: $500 spend (never reached before weekend started)
  
  14,000 API calls × avg 2,850 tokens each = 39.9M tokens
  Cost: $4,218.40 over 48 hours

Why our safeguards all failed simultaneously

We had three things in place that should have caught this. None of them did.

The SQS dead letter queue was configured. But when we'd migrated queue infrastructure two months earlier, the DLQ ARN in our Terraform config still pointed to the old environment. Nobody had noticed because the happy path kept working, and we'd never deliberately failed a message to verify delivery.

Spend alerts were set in CloudWatch at $500. The spend started Friday evening. AWS billing alerts lag by 6 to 12 hours because of usage aggregation, so the alert didn't fire until Sunday afternoon at $3,847. By that point it was Sunday afternoon of a long weekend and nobody was checking email.

We also tracked the error rate of the feedback pipeline. The consumer was catching the JSON.parse exception internally and returning the message to the queue, so it never surfaced as an application error. From metrics' perspective, the consumer was perfectly healthy. Receiving messages, processing them, no uncaught exceptions. Just lighting money on fire in the background.

The two root causes

When we did the post-mortem, we identified two independent root causes, either of which alone would have prevented the incident:

First: no stop sequence, no max_tokens. The API call had neither, so the model could generate arbitrarily long output. For the inputs that triggered the loop, each call was generating around 4,200 tokens before the context window cut it off. A max_tokens: 800 cap (more than enough for our real output) would have made each failed call 85% cheaper and kept us under the alert threshold for much longer.

Second: retry logic with no ceiling and a broken DLQ. SQS maxReceiveCount defaults to unlimited if the redrive policy is misconfigured. A message that perpetually fails parse will perpetually retry. Infinite retry plus broken DLQ plus no per-message retry counter means there's no circuit breaker at any layer. Either fix in isolation would have stopped the incident.

The fixes

We shipped four changes the same day:

// 1. Always cap tokens
const response = await openai.chat.completions.create({
  model: 'gpt-4-turbo',
  messages: [{ role: 'user', content: prompt }],
  max_tokens: 800,           // hard cap
  stop: ['}
', '}
'],   // stop on JSON object close
  temperature: 0.2,          // reduce variability for structured output
  response_format: { type: 'json_object' },  // enforce JSON mode
});

// 2. Track retry count per message, dead-letter after 3 attempts
const receiveCount = parseInt(message.Attributes?.ApproximateReceiveCount || '1');
if (receiveCount > 3) {
  await sendToDeadLetter(message, 'max_retries_exceeded');
  await deleteFromQueue(message);
  return;
}

// 3. Validate DLQ ARN on startup
await validateQueueExists(process.env.DLQ_URL!);

// 4. Set spend alert at $50, not $500
// (done in AWS console + Terraform)

The response_format: { type: 'json_object' } change mattered most. OpenAI's JSON mode guarantees valid JSON output, no unescaped curly braces, no markdown fencing, no trailing text. That alone would have prevented the whole mess. We hadn't used it because it was a newer API feature that wasn't in the documentation we'd copied from when we first built the pipeline. Nobody went back to check if anything new had shipped.

What OpenAI did

We contacted OpenAI support the same day. They looked at the usage logs, confirmed the pattern was consistent with a runaway retry loop, and refunded $3,400 as a one-time goodwill credit. Grateful, obviously, but under no illusion that it was guaranteed. Their terms don't require it. The financial exposure was real, and next time we might not get the credit.

Lessons

Always set max_tokens. Never let an LLM API call have unlimited output length in an automated pipeline. Calculate your maximum expected output and cap at 1.5x that.
Use response_format: json_object for structured output. Unstructured JSON parsing from free-form LLM output is a reliability anti-pattern.
Verify your dead letter queue actually works by sending a test message that will always fail and confirming it reaches the DLQ. Do this after every infrastructure migration. We didn't.
Spend alerts lag by hours. Set them at 10% of your pain threshold. A $50 alert would have fired Sunday morning when someone might still have been checking their phone.
LLM API cost is unbounded by default. Compute costs are capped by instance size; token costs scale linearly with runaway loops. Every LLM call is a potential infinite cost if your retry logic is broken.