The RAG Pipeline That Ran Fine in Dev and Fell Apart in Prod
A field report from a production RAG deployment that looked perfect in staging and degraded quietly in production for three weeks before anyone noticed. What broke, why it was invisible, and what we changed.
The retrieval quality metrics looked fine. The latency was acceptable. Users were getting answers. It was only when a client's team started cross-referencing AI responses against their source documents that they realised the system had been confidently returning stale, partially incorrect context for 22 days.
This is a field report. Names and domain details are anonymised. The failure modes are real.
What the System Was
A standard RAG setup: document ingestion pipeline, OpenAI embeddings, pgvector for retrieval, GPT-4o for generation. The document corpus was a living knowledge base — updated by the client's ops team daily. In dev, we tested against a static snapshot. In production, the corpus was live.
Failure Mode 1: Embedding Drift on Document Updates
When a source document was updated, the ingestion pipeline re-chunked and re-embedded the new version — but did not delete the old chunk vectors. Both the old and new versions of the same document existed in the vector store simultaneously. Retrieval would sometimes return the superseded version because its embedding was marginally closer to the query vector.
The fix: every chunk carries a document ID and version hash. On ingest, all chunks with the same document ID are deleted before new chunks are inserted. Atomic upsert, not append.
Failure Mode 2: Chunking Strategy That Broke Semantic Units
We had chunked by token count (512 tokens, 50-token overlap). This worked on well-structured documents. The production corpus included procedure documents where a critical 'exception' clause appeared three paragraphs after the rule it modified. The rule would be retrieved; the exception would not. The AI would answer as if the exception did not exist.
Chunking strategy is a domain problem, not a technical one. The right chunk boundary depends on how humans actually structure knowledge in that corpus.
Failure Mode 3: No Retrieval Quality Monitoring
We had latency monitoring. We had error rate monitoring. We had zero monitoring of retrieval quality — the cosine similarity scores of retrieved chunks, the hit rate of top-1 vs top-5 retrieval, or whether retrieved chunks were actually relevant to the query.
Now every RAG deployment we run logs the top-k similarity scores per query, tracks mean retrieval similarity over a rolling 24-hour window, and alerts when it drops more than 15% from baseline. This is the canary that catches embedding drift, corpus degradation, and chunking problems before users notice.
What We Ship Now as Standard
Atomic document upsert — delete old chunks, insert new, in a transaction
Domain-aware chunking — reviewed with a subject matter expert before go-live
Retrieval quality metrics in the observability stack from day one
A weekly offline evaluation run against a golden query set with known expected chunks
RAG is not hard to build. It is hard to keep working. Treat it like infrastructure — monitor it like infrastructure.