Our AI Documentation Bot Invented 14 API Routes That Never Existed — 6,000 Users Integrated Against Them
On a Tuesday afternoon, a developer from one of our largest enterprise customers opened a
support ticket that read: "Your POST /v2/webhooks/replay endpoint keeps
returning 404. Has it been deprecated?" We checked our route table.
POST /v2/webhooks/replay had never existed. Our AI documentation assistant
had invented it — described it in detail, with request/response examples, rate limit notes,
and error codes — and at least 6,000 developers had read that page.
This is the story of what happens when you deploy an LLM without a ground-truth validation layer, and what it costs when your documentation starts lying at scale.
The Setup: A Docs Bot That Seemed to Work Perfectly
We built an internal documentation assistant for our REST API — a RAG (Retrieval-Augmented Generation) system using GPT-4 Turbo. The idea was simple: developers ask questions in natural language, we retrieve relevant chunks from our OpenAPI spec and markdown docs, and GPT-4 synthesises a helpful answer.
In testing, it was genuinely impressive. It answered questions about auth flows, pagination patterns, webhook configurations — all correctly, all with accurate code examples. We ran 50 manual test cases. It passed 48. The two failures were minor phrasings, not factual errors. We shipped it.
ARCHITECTURE (what we built)
─────────────────────────────────────────────────────────────
Developer question
│
▼
Embedding model (text-embedding-3-small)
│
▼
Vector DB search → top-5 relevant doc chunks
│
▼
GPT-4 Turbo prompt:
"Answer using ONLY the context below. If unsure, say so."
│
▼
Response rendered in docs site
What we assumed: "If unsure, say so" would prevent hallucination
What actually happened: it didn't
The Hallucinations Were Confident, Detailed, and Wrong
Three weeks after launch, a customer success manager noticed something odd while helping
a customer: the bot was describing a GET /v2/events/stream endpoint that didn't
match anything in our actual API. She flagged it internally. We pulled the conversation logs.
What we found was worse than a few vague answers. The model had generated 14 completely fabricated endpoints across 847 conversation threads. These weren't vague suggestions — they were fully formed:
POST /v2/webhooks/replay— "replays a failed webhook delivery" (with request body schema, retry logic description, and example response)GET /v2/events/stream— "returns a Server-Sent Events stream for real-time event delivery" (with SSE format example)DELETE /v2/integrations/:id/cache— "clears cached integration state" (with a 204 response description)PATCH /v2/users/bulk— "batch-updates user attributes" (with pagination and rate limit notes)
Every one of these was plausible. They were exactly the kind of endpoints our API should have had. The model had interpolated from patterns in our existing routes and constructed logical, believable neighbours. This is precisely what makes LLM hallucination so dangerous: it doesn't make things up randomly — it makes things up reasonably.
Why "Answer Using Only the Context" Didn't Work
This is the part that genuinely surprised us. Our system prompt explicitly instructed the model to use only the retrieved context. We'd read the RAG playbooks. We'd followed the instructions. But the prompt instruction had a critical gap: it told the model to use the context, not to refuse to extrapolate beyond it.
The difference matters enormously. When a developer asked "Can I replay a failed webhook?" and our vector search returned chunks about webhook configuration and retry policies — but no chunk about a replay endpoint — the model faced a choice: say "I don't know" or synthesise a plausible answer from what it did know. GPT-4 is trained to be helpful. It synthesised.
THE FAILURE MODE ───────────────────────────────────────────────────── User: "Can I replay a failed webhook?" Retrieved context: - Chunk 1: "Webhooks retry up to 3 times on failure" - Chunk 2: "Webhook events have an event_id field" - Chunk 3: "Webhook status can be: pending, delivered, failed" No chunk: "Here is how to replay webhooks" Model reasoning (inferred): "Retries exist. Events have IDs. Failures are trackable. Logically, a replay endpoint should exist." Model output: "Yes! Use POST /v2/webhooks/replay with the event_id..." Reality: endpoint does not exist, never did
The Blast Radius
By the time we caught it, the damage was already distributed. Our documentation bot's
responses were being shared in Slack threads, Stack Overflow answers, and internal wikis
at customer companies. We found the fabricated GET /v2/events/stream endpoint
referenced in a Medium article, two GitHub repos, and a YouTube tutorial about our platform.
The support impact was immediate:
- 31 support tickets opened in the first week after we disabled the bot, all referencing hallucinated endpoints
- 4 enterprise customers had already built partial integrations against the fake routes
- One customer had deployed to production code that called
POST /v2/webhooks/replayas a background job — silently failing every time it ran
We were now in an impossible position: build the endpoints the AI had promised, or tell customers the documentation they'd read was wrong. We chose a combination of both.
The Fix: Ground-Truth Validation Before Every Response
The core architectural change was adding a validation layer between the LLM response and the user. Every API endpoint mentioned in a bot response now gets checked against our OpenAPI spec before the response is served:
REVISED ARCHITECTURE
─────────────────────────────────────────────────────────────
LLM response (raw)
│
▼
Route Extractor
(regex: [A-Z]+ /v[0-9]+/[a-z/:{}_]+ )
│
▼
OpenAPI Spec Validator
- Check each extracted route against spec
- Flag any route not present in spec
│
├─ All routes valid → serve response as-is
│
└─ Unknown route found →
Option A: replace with disclaimer
Option B: regenerate with stricter prompt
Option C: surface for human review
Also added to system prompt:
"If the retrieved context does not contain a specific endpoint,
say: 'I cannot confirm this endpoint exists. Please check the
official API reference at [URL].'"
We also changed the retrieval strategy. Instead of retrieving by semantic similarity alone, we added a hard filter: if a question contains an HTTP verb pattern, we only synthesise an answer if a retrieved chunk explicitly contains that exact route path. No match, no answer.
The prompt change that made the biggest difference was replacing:
// Before
"Answer using ONLY the context below. If unsure, say so."
// After
"Answer ONLY what is explicitly stated in the context below.
Do not infer, extrapolate, or suggest endpoints that are not
literally present in the retrieved text. If the context does
not directly answer the question, respond with exactly:
'I don't have enough context to answer this accurately.
Please refer to the API reference: [URL]'"
The word "explicitly" and the prohibition on inference cut hallucination rate from ~1.6% of responses to 0.03% in post-fix evaluation over 30,000 conversations.
The Routes We Actually Had to Build
For the 4 hallucinated endpoints that multiple enterprise customers had integrated against,
we made a pragmatic call: build them. POST /v2/webhooks/replay shipped three
weeks later. PATCH /v2/users/bulk was already on the roadmap — it moved up.
This is the uncomfortable reality of hallucination in developer tools: when your AI describes a feature confidently enough, customers will build against it. The hallucination becomes a de-facto product commitment.
Lessons
- "Use only the context" is not a hallucination prevention strategy. It is a preference instruction. Models will still extrapolate when context is adjacent but incomplete. You need validation, not just instruction.
- Plausible hallucinations are more dangerous than obvious ones. A model that invents nonsense is easy to catch. A model that invents reasonable API routes in your own naming convention will fool developers for weeks.
- Ground-truth validation must be domain-specific. For an API docs bot, validate every route against the spec. For a code-generation tool, run the code. For a schema assistant, validate the schema. The LLM's output must be checkable against a source of truth you control.
- Shared doc links spread hallucinations faster than you can patch them. Once a fabricated answer is copy-pasted into a customer's Confluence, it's out of your control. Monitoring conversation logs for hallucinations needs to happen in hours, not weeks.