We Upgraded Our Embedding Model and Our RAG Pipeline Returned Wrong Results for 6 Days
← Back
March 15, 2026AI10 min read

We Upgraded Our Embedding Model and Our RAG Pipeline Returned Wrong Results for 6 Days

Published March 15, 202610 min read

We upgraded our RAG pipeline's embedding model from text-embedding-ada-002 to text-embedding-3-large on a Tuesday afternoon. It was supposed to be a quality improvement — better semantic understanding, higher benchmark scores. For six days, it silently returned wrong answers to 180,000 user queries. Every response had HTTP 200. Every response contained real content from our corpus. Every response was wrong.

Production Failure

We run a B2B knowledge-base assistant — companies ingest their internal documentation and their support agents use it to answer customer questions in real time. Retrieval accuracy is the entire product. If the retrieved context is wrong, the LLM generates a confident, fluent, completely fabricated answer. The support agent sends it. The customer acts on it.

The first complaint arrived on day two: "Your assistant told our agent that the return window is 30 days. It's 14. We've already processed three refunds we shouldn't have." We assumed it was a hallucination issue — LLMs invent things sometimes, that's a known risk. We logged it, bumped the temperature down, and kept going.

Day four, three more tickets. One client flagged that their assistant was quoting pricing from a superseded document — a file they'd explicitly deleted six months ago. That shouldn't be possible. Deleted documents were removed from the index at deletion time. We checked. The document wasn't in the index. The assistant had quoted it anyway.

That's when I started digging.

False Assumptions

Our first assumption: the LLM was hallucinating. We increased the top_k retrieval count from 5 to 10 to give it more context, on the theory that better coverage would reduce fabrication. The wrong answers got worse.

Our second assumption: the vector index had corrupted state. We ran a full integrity check on our pgvector tables — row counts matched document counts, embeddings were non-null, IDs were consistent. Nothing broken.

Our third assumption: the problem was tenant-specific. We pulled retrieval logs for the complaining clients and compared them against a healthy baseline tenant. The retrieved document IDs looked plausible. We printed the actual document text and compared it to the query. That's when I got cold.

The retrieved documents were real. They were from the right tenant. They were indexed correctly. They just had nothing to do with the query.

Query: "What is the refund policy for digital downloads?"
Retrieved (rank 1, similarity: 0.81): "Q3 2024 infrastructure cost report — AWS Reserved Instance utilisation analysis..."

A similarity score of 0.81 is high. That document is nothing like the query. Something was deeply wrong with how similarity was being computed.

The Investigation

I wrote a debug script to compare embeddings directly. Take a known query, embed it, pull the top 5 nearest neighbours from pgvector, and print the cosine similarity alongside the document text. The numbers made no sense — high similarity scores (0.75–0.89) for documents that were semantically unrelated.

Then I checked a query against documents I knew should be highly similar — same paragraph, slight rewording. Similarity: 0.31. That's noise-level for text embeddings. Something was fundamentally broken in the vector space, not just an edge case.

I diff'd our deployment config across the upgrade window:

config diff — Tuesday deployment
# Before
OPENAI_EMBEDDING_MODEL=text-embedding-ada-002
EMBEDDING_DIMENSIONS=1536

# After
OPENAI_EMBEDDING_MODEL=text-embedding-3-large
EMBEDDING_DIMENSIONS=3072

There it was. We'd updated the model used for query embedding at runtime. We had not re-embedded the 2.3 million documents already stored in pgvector. Every stored document vector was still in ada-002 space. Every incoming query vector was now in 3-large space. We were measuring the distance between points in two completely different geometric universes and treating the result as meaningful similarity.

Root Cause

Different embedding models don't just produce vectors of different dimensions — they produce fundamentally different representations. The numbers mean different things. A vector that points toward "refund policy" in ada-002's learned space points toward something entirely different in 3-large's learned space. Cosine similarity between them is geometrically valid arithmetic on meaningless comparisons.

BROKEN: Mixed embedding spaces in pgvector

  ada-002 vector space (stored docs, 1536-dim)
  ┌─────────────────────────────────────────────────────┐
  │                                                     │
  │   [refund policy docs] ●●●   ● [pricing docs]       │
  │         ↑ closely clustered                         │
  │                                                     │
  │   [infra reports] ●●       ●● [HR docs]             │
  │                                                     │
  └─────────────────────────────────────────────────────┘

  3-large vector space (query embeddings, 3072-dim)
  ┌─────────────────────────────────────────────────────┐
  │                                                     │
  │   "What is the refund policy?" → [0.21, -0.83, ...] │
  │                    ↑ this vector is projected into  │
  │                      ada-002 space for comparison   │
  │                      but it means something else    │
  │                      entirely in that space         │
  └─────────────────────────────────────────────────────┘

  pgvector computes cosine similarity between them:
  cos(3-large query, ada-002 doc) = 0.81  ← high score!
  Actual semantic relevance:               near zero

  The math is correct. The interpretation is nonsense.

The failure mode is insidious because the system behaves perfectly at a structural level. Cosine similarity runs without errors. The scores are valid floats between -1 and 1. The top-k results are real documents. The LLM receives real context and generates fluent responses. There is no exception to catch. There is no anomalous latency. There is no 5xx. The product simply lies, confidently and consistently, to every user who asks it anything.

We measured retrospectively: across the six days the mismatch was active, 180,000 queries were served. Of those, we estimated roughly 40% involved a query where the semantically correct document was in the index but was ranked outside the top 5 because the wrong model was used to score it. Every one of those queries got a wrong answer. For the other 60%, we got lucky — either the correct document scored high by coincidence, or the mismatch was distributed enough that the LLM could still synthesise a reasonable answer from the noise.

The Fix

The immediate fix was to roll the query embedding model back to text-embedding-ada-002. Wrong answers stopped within minutes.

The correct fix was a full re-embedding of the corpus with the new model. We wrote a migration script that processed documents in batches of 200, using the OpenAI batch embedding API to stay within rate limits. The script tracked progress in a dedicated migration table — if it crashed mid-run, we could resume from the last committed batch.

scripts/reembed-corpus.py — key logic
import openai
import psycopg2
from datetime import datetime

BATCH_SIZE = 200
NEW_MODEL = "text-embedding-3-large"
NEW_DIM = 3072

def reembed_all(conn, model: str, batch_size: int):
    cursor = conn.cursor()

    # Fetch documents not yet migrated in this run
    cursor.execute("""
        SELECT id, content
        FROM documents
        WHERE embedding_model != %s
           OR embedding_model IS NULL
        ORDER BY id
    """, (model,))

    docs = cursor.fetchmany(batch_size)
    total = 0

    while docs:
        ids = [d[0] for d in docs]
        texts = [d[1] for d in docs]

        response = openai.embeddings.create(
            model=model,
            input=texts,
            dimensions=NEW_DIM
        )

        embeddings = [e.embedding for e in response.data]

        # Atomic update: embedding + model tag in one transaction
        for doc_id, embedding in zip(ids, embeddings):
            cursor.execute("""
                UPDATE documents
                SET embedding = %s::vector,
                    embedding_model = %s,
                    embedded_at = %s
                WHERE id = %s
            """, (embedding, model, datetime.utcnow(), doc_id))

        conn.commit()
        total += len(docs)
        print(f"Migrated {total} documents...")

        docs = cursor.fetchmany(batch_size)

    return total

Re-embedding 2.3M documents took 4 hours and cost $47 in OpenAI API calls. We also added an embedding_model column to the documents table — every row now records which model produced its vector. This made the mismatch detectable in a future audit: a query against documents WHERE embedding_model != current_model() would immediately surface any stale vectors.

What the Deployment Should Have Looked Like

The real problem wasn't that we upgraded the model. It was that we had no guard against using a new embedding model for queries while old embeddings still lived in the database. Here's what a safe embedding model migration looks like:

SAFE: Dual-model migration strategy

  Phase 1: Add new model column, start dual-writing
  ┌───────────────────────────────────────────────────┐
  │  New documents → embed with BOTH ada-002 + 3-large │
  │  Old documents → schedule background re-embedding  │
  │  Queries → STILL use ada-002 (no switch yet)       │
  └───────────────────────────────────────────────────┘
            ↓
  Phase 2: Monitor re-embedding progress
  ┌───────────────────────────────────────────────────┐
  │  Track: docs with 3-large embedding / total docs   │
  │  Gate: must be 100% before query model switches    │
  └───────────────────────────────────────────────────┘
            ↓
  Phase 3: Switch query model atomically
  ┌───────────────────────────────────────────────────┐
  │  Config flag: QUERY_EMBEDDING_MODEL=3-large        │
  │  Validation: assert 0 docs with embedding_model    │
  │              != 'text-embedding-3-large' in index  │
  └───────────────────────────────────────────────────┘
            ↓
  Phase 4: Drop old ada-002 embedding column

The critical insight: the query model and the stored embedding model must always match. Any deployment step that changes one without changing the other is incorrect by definition. This should be enforced at the application layer, not left to human discipline.

What We Added to Prevent Recurrence

Three concrete changes, deployed within a week of the incident:

  • Startup model consistency check: On every application boot, we query SELECT COUNT(*) FROM documents WHERE embedding_model != $1 with the configured query model. If the count is non-zero, the application refuses to start and logs an error: "Embedding model mismatch: N documents indexed with old model. Run reembed-corpus.py before switching query model." Hard boot failure beats silent wrong answers.
  • Embedding version in retrieval responses: Every retrieval result now includes an embedding_model field in the internal API response. Our evaluation harness checks that query model and document embedding model match for every result. A mismatch fails the evaluation run before it reaches the LLM.
  • Semantic relevance smoke test in CI: We added a suite of 50 hand-labelled (query, expected-document-slug) pairs. On every deploy, we embed the queries and assert that the expected documents appear in the top 3 results. This test runs in under 90 seconds and would have caught this bug on the first deploy.
6 days undetected
180k queries affected
2.3M docs re-embedded
$47 fix cost

Lessons Learned

The lesson I keep coming back to: retrieval quality has no natural error signal. A database query that returns wrong data throws an exception, or at least returns zero rows. A vector similarity search that returns semantically wrong results returns a perfectly valid list of floats. The system is structurally healthy while functionally broken.

This means RAG pipelines need their own quality monitoring layer — one that doesn't exist in the infrastructure stack and doesn't get created by default. You have to build it. Some things that should be in every production RAG system:

  • Embedding model provenance on every row. You need to know what model produced each vector. Without it, you can't audit for staleness, you can't migrate safely, and you can't debug relevance regressions.
  • Relevance sampling in production. Log a sample of (query, retrieved-docs) pairs and run periodic relevance scoring — even a simple BM25 keyword overlap score is enough to flag gross mismatches. Ours showed retrieval quality drop from 0.87 to 0.31 relevance score within hours of the bad deploy. We just weren't looking at it.
  • Semantic regression tests gating deploys. If you wouldn't deploy a backend API change without a test that checks the response body, you shouldn't deploy an embedding model change without a test that checks retrieval quality. They're both query-response contracts. Treat them the same way.

The RAG pattern has gone from research curiosity to production infrastructure in two years. The tooling is maturing fast — pgvector, Pinecone, Weaviate all handle the storage and retrieval mechanics well. The operational discipline around managing that infrastructure — migration safety, quality monitoring, model version consistency — is still mostly learned the hard way.

We learned it the hard way. Hopefully this saves you the same tuition.

Rey, writing for Darshan Turakhia · March 2026
Share this
← All Posts10 min read