RAG Pipeline Architecture: What Developers Need to Know
What Is RAG and Why Does It Matter?
RAG stands for Retrieval-Augmented Generation. It's a technique for giving AI models access to external information without retraining them. Instead of relying on a model's training data (which is outdated), you feed the model current information when you ask a question.
The Problem RAG Solves
LLMs have a training cutoff: they do not reliably know about events, APIs, or internal documents that appeared after that date. Ask about very recent releases or private data and the model may guess. Your company's internal documentation was never in the training set. A library that shipped last week may be missing or wrong in the model's answers.
RAG solves this: fetch relevant information, then pass it to the model along with the question. The model answers using current, relevant data.
The Result
You get a chatbot that knows your company's docs, a code assistant that understands your codebase, a search engine that finds answers in your knowledge base. All powered by a single LLM, without retraining.
The Five Components of a RAG Pipeline
1. Document Ingestion
You start with sources of truth: PDFs, web pages, code repositories, databases, markdown files, anything with information.
What happens: Documents are loaded from their source. A PDF file is read. A GitHub repo is cloned. A database is queried.
Implementation: Loaders for different formats.
- PDF loaders (PyPDF, pdfplumber)
- Web scrapers (Beautiful Soup, Selenium)
- Code loaders (read from GitHub API or local filesystem)
- Database readers (SQL queries)
- Office docs (python-docx for Word, openpyxl for Excel)
Example:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
# Now you have: [{"page_content": "...", "metadata": {"source": "..."}}, ...]
Result: Raw documents in memory, ready for processing.
2. Chunking
Documents are large (50 pages, 1000s of lines of code). You can't send a whole document to an LLM every time someone asks a question. So you split documents into smaller pieces called chunks.
What happens: A document is split into overlapping pieces. A paragraph becomes its own chunk. Code is split by function or class. The overlap helps preserve context.
Why overlap matters: If a question spans a sentence that's split across two chunks, you want both chunks in the context.
Implementation: Text splitters with configurable chunk size and overlap.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Characters per chunk
chunk_overlap=200 # Overlap for context
)
chunks = splitter.split_documents(documents)
Result: A list of 1000-character pieces, with metadata about their source.
3. Embedding
Chunks are text. Vector databases work with numbers. Embedding converts text to a vector (list of numbers) that captures meaning.
What happens: Each chunk is sent to an embedding model. The model reads "The sky is blue" and outputs [0.2, -0.5, 0.8, ...]. Similar meanings get similar vectors.
Why this matters: You can now use math to find similar chunks. If a user asks "What color is the sky?", you embed that question and search for chunks with similar vectors. You don't need exact keyword matches.
Implementation: Embedding models.
- OpenAI embeddings (high quality, costs money)
- Sentence-Transformers (local, free, good quality)
- Cohere Embed (cloud-based)
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vector = embeddings.embed_query("The sky is blue")
# Output: [0.2, -0.5, 0.8, ...]
Result: Each chunk has a vector representation.
4. Vector Storage
Vectors are stored in a vector database. When a user asks a question, you embed the question, search the database for similar vectors, and retrieve the corresponding chunks.
What happens: Chunks and their vectors are stored in a database optimized for similarity search. You query by vector, not by keywords.
Vector databases (tools that store and search vectors efficiently):
- Pinecone (cloud-hosted, managed)
- Weaviate (self-hosted or cloud)
- Chromadb (simple, lightweight, good for development)
- Milvus (open-source, scalable)
- Supabase (PostgreSQL with pgvector extension)
- Qdrant (self-hosted, fast)
from langchain.vectorstores import Chroma
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
Result: Chunks and vectors are stored. You can search them by similarity.
5. Retrieval and Generation
When a user asks a question, you:
- Embed the question
- Search the vector database for similar chunks
- Retrieve the top K chunks (e.g., top 5)
- Send the question + chunks to the LLM
- The LLM generates an answer using the retrieved information
Implementation:
query = "What's the company vacation policy?"
# Embed the query
query_vector = embeddings.embed_query(query)
# Search for similar chunks (retrieval)
retrieved_chunks = vector_store.similarity_search(query, k=5)
# Format as context
context = "\n".join([chunk.page_content for chunk in retrieved_chunks])
# Send to LLM with context
prompt = f"""Answer the question using the provided context.
Context:
{context}
Question: {query}
"""
answer = llm.generate(prompt)
Result: The LLM answers based on current, relevant information from your documents.
Putting It Together: A Simple RAG Architecture
Documents
|
v
Ingestion (load PDFs, web pages, code)
|
v
Chunking (split into 1000-char pieces)
|
v
Embedding (convert to vectors)
|
v
Vector Database (store chunks + vectors)
|
v
User asks a question
|
v
Embed question + Search (find 5 similar chunks)
|
v
Retrieve chunks + Format as context
|
v
LLM (question + context -> answer)
|
v
Return answer to user
Common RAG Architectures
Simple RAG
One vector store, straightforward retrieval, one LLM. Good for basic use cases like a company documentation chatbot.
Hierarchical RAG
Chunks are organized in a hierarchy. Short chunks summarize larger sections. Retrieval starts at a high level and drills down for details. Good for large, structured documents.
Multi-Vector RAG
Each document has multiple vector representations (one for each section, one for the summary, etc.). Retrieval uses the best representation for the question. Better accuracy, more complex.
Hybrid Search
Combines vector search (semantic) with keyword search (exact matches). A question might match semantically similar chunks and also match exact keywords. Results are merged. Good for both conceptual and factual questions.
Agent-Based RAG
The LLM decides whether to search, what to search for, whether to search again, etc. It becomes an agent that orchestrates retrieval. More flexible but more complex.
When to Use RAG vs. Alternatives
RAG Is Right When:
- You have a corpus of documents (company docs, knowledge base, codebase)
- Information changes frequently (news, logs, user-generated content)
- You need to cite sources
- Cost matters (cheaper than fine-tuning)
- You want quick setup (no model training)
Fine-Tuning Is Better When:
- You need to change the model's behavior fundamentally (writing style, domain knowledge)
- You have high volumes of queries (cost of fine-tuning amortizes)
- You want to teach the model new facts permanently
- Your documents are very large (easier than RAG)
Long Context Windows Are Better When:
- All relevant context fits in the model's context window (100K tokens for Claude, 200K for Gemini)
- You don't need dynamic retrieval
- Your data is private and you don't want a vector database
Popular RAG Frameworks
Don't build RAG from scratch. Use a framework:
- LangChain - Comprehensive, chains components together, works with many models and vector stores
- LlamaIndex - Focused on RAG specifically, strong on data indexing and retrieval
- Haystack - Production-ready, modular, good documentation
- LiteLLM - If you just need a simpler wrapper for LLM calls
Challenges and Gotchas
Chunk Size
Too small: chunks lack context, retrieval is scattered. Too large: only top results matter, chunking defeats the purpose. Sweet spot is usually 500-2000 characters with 100-400 character overlap.
Relevance
Vector search can retrieve irrelevant chunks (semantic noise). Use reranking: after retrieval, have the LLM rank results by relevance. Or use hybrid search combining keyword and semantic matching.
Citation
You want the LLM to cite which documents it used. Include source metadata in chunks and ask the LLM to include citations in its answer.
Hallucination
Even with context, LLMs sometimes ignore the provided information and make things up. Mitigate by asking the LLM to refuse to answer if the context doesn't support the answer.
Building Your First RAG System
- Pick a framework (LangChain or LlamaIndex)
- Choose a vector database (Chromadb for development, Pinecone for production)
- Load your documents
- Split and embed them
- Build a retrieval chain
- Test with sample questions
- Measure accuracy and iterate on chunk size, embedding model, and retrieval parameters
Start simple. Add complexity (reranking, hybrid search, hierarchical retrieval) only when you need it.
Discussion
Sign in to comment. Your account must be at least 1 day old.