RAG Systems in Production: What Works and What Doesn't
What Is RAG and Why Does It Matter?
Retrieval-Augmented Generation (RAG) is an architecture that lets you query a large language model with context from your own documents. Instead of relying on a model's training data, you retrieve relevant passages from your knowledge base at query time and include them in the prompt. The model answers based on what you retrieved, not on what it was trained on.
This solves two major LLM limitations: knowledge cutoffs (the model doesn't know about recent documents) and hallucination on specific facts (the model makes things up when it doesn't know something specific). With RAG, the model is grounded in your actual documents.
The use cases are extensive: customer support over product documentation, internal knowledge base Q&A, contract analysis, research synthesis, code documentation querying. Any situation where you need accurate, up-to-date, document-grounded answers.
The Three Core Components
1. Document ingestion and chunking. You take your source documents (PDFs, markdown files, web pages, database records) and break them into chunks. The right chunk size depends on your documents—typically 200-500 tokens per chunk, with some overlap to preserve context at boundaries. Each chunk is stored with metadata: source document, section, date, etc.
2. Embedding and vector storage. Each chunk is converted to an embedding—a numerical representation that captures semantic meaning—using an embedding model. These embeddings are stored in a vector database (Chroma, Pinecone, Weaviate, pgvector). The vector DB enables similarity search: given a query, find the chunks whose embeddings are most similar.
3. Retrieval and generation. At query time: embed the user's query, retrieve the K most similar chunks from the vector DB, include those chunks in the LLM prompt, ask the model to answer based on the retrieved context.
Where Production RAG Systems Break Down
Most RAG tutorials show the happy path—clean documents, straightforward questions, matching chunks. Production systems deal with everything else.
Chunking mismatch. The right chunk size varies dramatically by document type. A 400-token chunk from a legal contract might split mid-clause, losing the context that makes it meaningful. A 400-token chunk from a FAQ might contain multiple unrelated Q&A pairs. There's no universal answer—you need to experiment with your actual documents.
Retrieval quality. Semantic similarity doesn't always match query relevance. A question about "cancellation policy" might retrieve chunks about "account cancellation" and "subscription management" instead of the "order cancellation" section you actually need. Hybrid search (combining semantic similarity with keyword search) typically outperforms pure semantic retrieval for domain-specific queries.
Context window limits. Retrieving too many chunks can overflow your context window. Retrieving too few misses relevant information. Re-ranking retrieved chunks and summarizing less-relevant ones before inclusion is a production pattern that helps.
Document freshness. When source documents update, your vector database is stale until you re-index. For frequently changing content, real-time or near-real-time ingestion pipelines become necessary. Most production RAG systems have a freshness problem they don't fully solve.
Citation accuracy. Users often ask "where did this come from?" Providing accurate citations requires tracking chunk provenance carefully and returning source references with each answer. This is straightforward architecturally but easy to implement carelessly.
Production Architecture Patterns
Pipeline architecture. Separate ingestion (document → chunk → embed → store) from serving (query → retrieve → generate). This lets you update the knowledge base without affecting serving, and scale each component independently.
Metadata filtering. Vector search alone can return irrelevant results from other departments, outdated documents, or wrong languages. Filter by metadata before or during retrieval: only search documents tagged "active", only search the user's team's documents, only search English-language content.
Query expansion. Short user queries often miss relevant documents. Expand the query before retrieval: use an LLM to generate synonyms, related terms, or multiple phrasings of the question. Retrieve for all variants, deduplicate, and select the best.
Evaluation at every layer. Build evaluation datasets with representative questions and expected answers. Measure retrieval precision (are the right chunks retrieved?) and generation accuracy (are answers correct?) separately. Running evals before and after changes tells you whether changes helped.
Human review for edge cases. For high-stakes domains (medical, legal, financial), add a human review step for low-confidence answers. Track confidence signals (retrieval score thresholds, model self-reported uncertainty) to route to review.
Tools and Frameworks
LangChain and LlamaIndex are the two dominant orchestration frameworks. Both handle chunking, embedding, retrieval, and generation. LangChain is broader but more complex; LlamaIndex is more focused on RAG and often easier for document-heavy use cases.
For vector databases: Chroma for local/development, pgvector if you're already on Postgres, Pinecone or Weaviate for managed production. The choice matters less than indexing quality and retrieval tuning.
The Honest Assessment
RAG works. For many document-grounded Q&A use cases, it's the right architecture. But production RAG is not a solved problem. Retrieval quality, chunk design, document freshness, and evaluation are all ongoing engineering challenges. Teams that treat RAG as a plug-and-play system discover its limitations quickly. Teams that invest in evaluation, iteration, and the architecture details above build systems that reliably deliver value.
Discussion
Sign in to comment. Your account must be at least 1 day old.