The gap between a RAG proof-of-concept and a production system that actually works is wider than most teams expect.
In my experience, the failures tend to cluster around the same handful of mistakes.
The retrieval unit is wrong
Fixed-size chunking is the default because it’s easy — split every 500 tokens, done. But documents don’t respect token boundaries. A chunk that cuts a table in half, or separates a heading from its body, produces retrievals that are syntactically present but semantically useless.
Semantic chunking — splitting on meaningful boundaries like paragraphs, sections, or logical units — takes more effort upfront but dramatically improves retrieval quality.
Nobody evaluated retrieval separately from generation
Teams often evaluate the end-to-end system (“does the final answer look right?”) without ever measuring whether the right context was retrieved in the first place. This makes debugging almost impossible.
Measure faithfulness and relevance as independent signals. A bad answer caused by bad retrieval requires a completely different fix than a bad answer caused by bad generation.
The embedding model doesn’t match the domain
General-purpose embedding models work well for general-purpose text. Legal documents, medical literature, and internal technical documentation each have vocabulary and structure that can undermine off-the-shelf embeddings. Fine-tuning an embedding model — or at minimum evaluating domain-specific alternatives — is often the highest-leverage improvement available.
No reranking
Approximate nearest-neighbour search is fast but imprecise. A reranker as a second-pass filter — running a more expensive relevance model over the top-k candidates — consistently improves precision without blowing up latency budgets.
Evaluation was a one-time event
RAG systems degrade silently. The underlying knowledge base changes, query distribution shifts, model providers update their embeddings. Without continuous evaluation in production, you won’t know until a user complains.

