March 15, 2026AI10 min read

We Upgraded Our Embedding Model and Our RAG Pipeline Returned Wrong Results for 6 Days

Published March 15, 202610 min read

We upgraded our RAG pipeline's embedding model from text-embedding-ada-002 to text-embedding-3-large on a Tuesday afternoon. Supposed to be a quality improvement, better semantic understanding, higher benchmark scores. For six days it silently returned wrong answers to 180,000 user queries. Every response was HTTP 200. Every response contained real content from our corpus. Every response was wrong.

Production failure

We run a B2B knowledge-base assistant. Companies ingest their internal documentation and their support agents use it to answer customer questions in real time. Retrieval accuracy is the entire product. If the retrieved context is wrong, the LLM generates a confident, fluent, completely fabricated answer. The agent sends it. The customer acts on it.

The first complaint arrived on day two: "Your assistant told our agent that the return window is 30 days. It's 14. We've already processed three refunds we shouldn't have." We assumed it was a hallucination issue. LLMs invent things sometimes, that's a known risk. We logged it, bumped the temperature down, and kept going. This is the part I wish I could take back, because we'd just trained ourselves to dismiss the signal.

Day four, three more tickets. One client flagged that their assistant was quoting pricing from a superseded document: a file they'd explicitly deleted six months ago. That shouldn't be possible. Deleted documents were removed from the index at deletion time. We checked. The document wasn't in the index. The assistant had quoted it anyway.

That's when I started digging.

False assumptions

First assumption: the LLM was hallucinating. We increased top_k from 5 to 10 to give it more context, on the theory that better coverage would reduce fabrication. The wrong answers got worse.

Second assumption: the vector index had corrupted state. We ran a full integrity check on our pgvector tables. Row counts matched document counts, embeddings were non-null, IDs were consistent. Nothing broken.

Third assumption: the problem was tenant-specific. We pulled retrieval logs for the complaining clients and compared them against a healthy baseline tenant. The retrieved document IDs looked plausible. We printed the actual document text and compared it to the query. That's when I got cold.

The retrieved documents were real. They were from the right tenant. They were indexed correctly. They just had nothing to do with the query.

Query: "What is the refund policy for digital downloads?"
Retrieved (rank 1, similarity: 0.81): "Q3 2024 infrastructure cost report — AWS Reserved Instance utilisation analysis..."

A similarity score of 0.81 is high. That document is nothing like the query. Something was deeply wrong with how similarity was being computed.

The investigation

I wrote a debug script to compare embeddings directly. Take a known query, embed it, pull the top 5 nearest neighbours from pgvector, print the cosine similarity alongside the document text. The numbers made no sense. High similarity scores (0.75 to 0.89) for documents that were semantically unrelated.

Then I checked a query against documents I knew should be highly similar, same paragraph with slight rewording. Similarity: 0.31. That's noise-level for text embeddings. Something was fundamentally broken in the vector space.

I diff'd the deployment config across the upgrade window.

config diff — Tuesday deployment

# Before
OPENAI_EMBEDDING_MODEL=text-embedding-ada-002
EMBEDDING_DIMENSIONS=1536

# After
OPENAI_EMBEDDING_MODEL=text-embedding-3-large
EMBEDDING_DIMENSIONS=3072

There it was. We'd updated the model used for query embedding at runtime. We had not re-embedded the 2.3 million documents already stored in pgvector. Every stored document vector was still in ada-002 space. Every incoming query vector was now in 3-large space. We were measuring the distance between points in two different geometric universes and treating the result as meaningful similarity.

Root cause

Different embedding models don't just produce vectors of different dimensions. They produce fundamentally different representations. The numbers mean different things. A vector that points toward "refund policy" in ada-002's learned space points somewhere else entirely in 3-large's learned space. Cosine similarity between them is geometrically valid arithmetic on meaningless comparisons.

BROKEN: Mixed embedding spaces in pgvector

  ada-002 vector space (stored docs, 1536-dim)
  ┌─────────────────────────────────────────────────────┐
  │                                                     │
  │   [refund policy docs] ●●●   ● [pricing docs]       │
  │         ↑ closely clustered                         │
  │                                                     │
  │   [infra reports] ●●       ●● [HR docs]             │
  │                                                     │
  └─────────────────────────────────────────────────────┘

  3-large vector space (query embeddings, 3072-dim)
  ┌─────────────────────────────────────────────────────┐
  │                                                     │
  │   "What is the refund policy?" → [0.21, -0.83, ...] │
  │                    ↑ this vector is projected into  │
  │                      ada-002 space for comparison   │
  │                      but it means something else    │
  │                      entirely in that space         │
  └─────────────────────────────────────────────────────┘

  pgvector computes cosine similarity between them:
  cos(3-large query, ada-002 doc) = 0.81  ← high score!
  Actual semantic relevance:               near zero

  The math is correct. The interpretation is nonsense.

The failure mode is insidious because the system behaves perfectly at a structural level. Cosine similarity runs without errors. The scores are valid floats between -1 and 1. The top-k results are real documents. The LLM receives real context and generates fluent responses. There is no exception to catch, no anomalous latency, no 5xx. The product simply lies, confidently and consistently, to every user who asks it anything.

Retrospectively: across the six days the mismatch was active, 180,000 queries were served. We estimated roughly 40% involved a query where the semantically correct document was in the index but ranked outside the top 5 because the wrong model was used to score it. Every one of those got a wrong answer. For the other 60% we got lucky. Either the correct document scored high by coincidence, or the mismatch was distributed enough that the LLM could still synthesise a reasonable answer from the noise.

The fix

Immediate fix: roll the query embedding model back to text-embedding-ada-002. Wrong answers stopped within minutes.

Correct fix: re-embed the full corpus with the new model. We wrote a migration script that processed documents in batches of 200, using the OpenAI batch embedding API to stay within rate limits. The script tracked progress in a dedicated migration table, so if it crashed mid-run we could resume from the last committed batch.

scripts/reembed-corpus.py — key logic

import openai
import psycopg2
from datetime import datetime

BATCH_SIZE = 200
NEW_MODEL = "text-embedding-3-large"
NEW_DIM = 3072

def reembed_all(conn, model: str, batch_size: int):
    cursor = conn.cursor()

    # Fetch documents not yet migrated in this run
    cursor.execute("""
        SELECT id, content
        FROM documents
        WHERE embedding_model != %s
           OR embedding_model IS NULL
        ORDER BY id
    """, (model,))

    docs = cursor.fetchmany(batch_size)
    total = 0

    while docs:
        ids = [d[0] for d in docs]
        texts = [d[1] for d in docs]

        response = openai.embeddings.create(
            model=model,
            input=texts,
            dimensions=NEW_DIM
        )

        embeddings = [e.embedding for e in response.data]

        # Atomic update: embedding + model tag in one transaction
        for doc_id, embedding in zip(ids, embeddings):
            cursor.execute("""
                UPDATE documents
                SET embedding = %s::vector,
                    embedding_model = %s,
                    embedded_at = %s
                WHERE id = %s
            """, (embedding, model, datetime.utcnow(), doc_id))

        conn.commit()
        total += len(docs)
        print(f"Migrated {total} documents...")

        docs = cursor.fetchmany(batch_size)

    return total

Re-embedding 2.3M documents took 4 hours and cost $47 in OpenAI API calls. We also added an embedding_model column to the documents table. Every row now records which model produced its vector, which makes the mismatch detectable in a future audit: a query against documents WHERE embedding_model != current_model() will immediately show any stale vectors.

What the deployment should have looked like

The real problem wasn't that we upgraded the model. It was that we had no guard against using a new embedding model for queries while old embeddings still lived in the database. Here's what a safe embedding model migration looks like.

SAFE: Dual-model migration strategy

  Phase 1: Add new model column, start dual-writing
  ┌───────────────────────────────────────────────────┐
  │  New documents → embed with BOTH ada-002 + 3-large │
  │  Old documents → schedule background re-embedding  │
  │  Queries → STILL use ada-002 (no switch yet)       │
  └───────────────────────────────────────────────────┘
            ↓
  Phase 2: Monitor re-embedding progress
  ┌───────────────────────────────────────────────────┐
  │  Track: docs with 3-large embedding / total docs   │
  │  Gate: must be 100% before query model switches    │
  └───────────────────────────────────────────────────┘
            ↓
  Phase 3: Switch query model atomically
  ┌───────────────────────────────────────────────────┐
  │  Config flag: QUERY_EMBEDDING_MODEL=3-large        │
  │  Validation: assert 0 docs with embedding_model    │
  │              != 'text-embedding-3-large' in index  │
  └───────────────────────────────────────────────────┘
            ↓
  Phase 4: Drop old ada-002 embedding column

The critical rule: the query model and the stored embedding model must always match. Any deployment step that changes one without changing the other is incorrect by definition. This should be enforced at the application layer, not left to human discipline.

What we added to prevent recurrence

Three concrete changes, deployed within a week of the incident.

Startup model consistency check. On every application boot, we query SELECT COUNT(*) FROM documents WHERE embedding_model != $1 with the configured query model. If the count is non-zero, the application refuses to start and logs an error: "Embedding model mismatch: N documents indexed with old model. Run reembed-corpus.py before switching query model." Hard boot failure beats silent wrong answers.

Embedding version in retrieval responses. Every retrieval result now includes an embedding_model field in the internal API response. Our evaluation harness checks that query model and document embedding model match for every result. A mismatch fails the evaluation run before it reaches the LLM.

Semantic relevance smoke test in CI. We added a suite of 50 hand-labelled (query, expected-document-slug) pairs. On every deploy, we embed the queries and assert the expected documents appear in the top 3 results. This test runs in under 90 seconds and would have caught the bug on the first deploy.

6 days undetected

180k queries affected

2.3M docs re-embedded

$47 fix cost

Lessons learned

The lesson I keep coming back to is that retrieval quality has no natural error signal. A database query that returns wrong data throws an exception or returns zero rows. A vector similarity search that returns semantically wrong results returns a perfectly valid list of floats. The system is structurally healthy while functionally broken.

This means RAG pipelines need their own quality monitoring layer. It doesn't exist in the infrastructure stack and doesn't get created by default. You have to build it.

Embedding model provenance on every row. You need to know which model produced each vector. Without it, you can't audit for staleness, migrate safely, or debug relevance regressions.

Relevance sampling in production. Log a sample of (query, retrieved-docs) pairs and run periodic relevance scoring. Even a simple BM25 keyword overlap is enough to flag gross mismatches. Ours showed retrieval quality drop from 0.87 to 0.31 within hours of the bad deploy. We just weren't looking at it.

Semantic regression tests gating deploys. If you wouldn't deploy a backend API change without a test that checks the response body, you shouldn't deploy an embedding model change without a test that checks retrieval quality. Same kind of contract.