Now that you have explored RAG with local models, this section covers building production RAG pipelines and AI features.

RAG Pipeline Architecture: What Developers Need to Know

intermediate366 reads

Bookmark

What Is RAG and Why Does It Matter?

RAG stands for Retrieval-Augmented Generation. It's a technique for giving AI models access to external information without retraining them. Instead of relying on a model's training data (which is outdated), you feed the model current information when you ask a question.

The Problem RAG Solves

LLMs have a training cutoff: they do not reliably know about events, APIs, or internal documents that appeared after that date. Ask about very recent releases or private data and the model may guess. Your company's internal documentation was never in the training set. A library that shipped last week may be missing or wrong in the model's answers.

RAG solves this: fetch relevant information, then pass it to the model along with the question. The model answers using current, relevant data.

The Result

You get a chatbot that knows your company's docs, a code assistant that understands your codebase, a search engine that finds answers in your knowledge base. All powered by a single LLM, without retraining.

The Five Components of a RAG Pipeline

1. Document Ingestion

You start with sources of truth: PDFs, web pages, code repositories, databases, markdown files, anything with information.

What happens: Documents are loaded from their source. A PDF file is read. A GitHub repo is cloned. A database is queried.

Implementation: Loaders for different formats.

PDF loaders (PyPDF, pdfplumber)
Web scrapers (Beautiful Soup, Selenium)
Code loaders (read from GitHub API or local filesystem)
Database readers (SQL queries)
Office docs (python-docx for Word, openpyxl for Excel)

Example:

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
# Now you have: [{"page_content": "...", "metadata": {"source": "..."}}, ...]

Result: Raw documents in memory, ready for processing.

2. Chunking

Documents are large (50 pages, 1000s of lines of code). You can't send a whole document to an LLM every time someone asks a question. So you split documents into smaller pieces called chunks.

What happens: A document is split into overlapping pieces. A paragraph becomes its own chunk. Code is split by function or class. The overlap helps preserve context.

Why overlap matters: If a question spans a sentence that's split across two chunks, you want both chunks in the context.

Implementation: Text splitters with configurable chunk size and overlap.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Characters per chunk
    chunk_overlap=200  # Overlap for context
)

chunks = splitter.split_documents(documents)

Result: A list of 1000-character pieces, with metadata about their source.

3. Embedding

Chunks are text. Vector databases work with numbers. Embedding converts text to a vector (list of numbers) that captures meaning.

What happens: Each chunk is sent to an embedding model. The model reads "The sky is blue" and outputs [0.2, -0.5, 0.8, ...]. Similar meanings get similar vectors.

Why this matters: You can now use math to find similar chunks. If a user asks "What color is the sky?", you embed that question and search for chunks with similar vectors. You don't need exact keyword matches.

Implementation: Embedding models.

OpenAI embeddings (high quality, costs money)
Sentence-Transformers (local, free, good quality)
Cohere Embed (cloud-based)

from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

vector = embeddings.embed_query("The sky is blue")
# Output: [0.2, -0.5, 0.8, ...]

Result: Each chunk has a vector representation.

4. Vector Storage

Vectors are stored in a vector database. When a user asks a question, you embed the question, search the database for similar vectors, and retrieve the corresponding chunks.

What happens: Chunks and their vectors are stored in a database optimized for similarity search. You query by vector, not by keywords.

Vector databases (tools that store and search vectors efficiently):

Pinecone (cloud-hosted, managed)
Weaviate (self-hosted or cloud)
Chromadb (simple, lightweight, good for development)
Milvus (open-source, scalable)
Supabase (PostgreSQL with pgvector extension)
Qdrant (self-hosted, fast)

from langchain.vectorstores import Chroma

vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

Result: Chunks and vectors are stored. You can search them by similarity.

5. Retrieval and Generation

When a user asks a question, you:

Embed the question
Search the vector database for similar chunks
Retrieve the top K chunks (e.g., top 5)
Send the question + chunks to the LLM
The LLM generates an answer using the retrieved information

Implementation:

query = "What's the company vacation policy?"

# Embed the query
query_vector = embeddings.embed_query(query)

# Search for similar chunks (retrieval)
retrieved_chunks = vector_store.similarity_search(query, k=5)

# Format as context
context = "\n".join([chunk.page_content for chunk in retrieved_chunks])

# Send to LLM with context
prompt = f"""Answer the question using the provided context.

Context:
{context}

Question: {query}
"""

answer = llm.generate(prompt)

Result: The LLM answers based on current, relevant information from your documents.

Putting It Together: A Simple RAG Architecture

Documents
    |
    v
Ingestion (load PDFs, web pages, code)
    |
    v
Chunking (split into 1000-char pieces)
    |
    v
Embedding (convert to vectors)
    |
    v
Vector Database (store chunks + vectors)
    |
    v
User asks a question
    |
    v
Embed question + Search (find 5 similar chunks)
    |
    v
Retrieve chunks + Format as context
    |
    v
LLM (question + context -> answer)
    |
    v
Return answer to user

Common RAG Architectures

Simple RAG

One vector store, straightforward retrieval, one LLM. Good for basic use cases like a company documentation chatbot.

Hierarchical RAG

Chunks are organized in a hierarchy. Short chunks summarize larger sections. Retrieval starts at a high level and drills down for details. Good for large, structured documents.

Multi-Vector RAG

Each document has multiple vector representations (one for each section, one for the summary, etc.). Retrieval uses the best representation for the question. Better accuracy, more complex.

Hybrid Search

Combines vector search (semantic) with keyword search (exact matches). A question might match semantically similar chunks and also match exact keywords. Results are merged. Good for both conceptual and factual questions.

Agent-Based RAG

The LLM decides whether to search, what to search for, whether to search again, etc. It becomes an agent that orchestrates retrieval. More flexible but more complex.

When to Use RAG vs. Alternatives

RAG Is Right When:

You have a corpus of documents (company docs, knowledge base, codebase)
Information changes frequently (news, logs, user-generated content)
You need to cite sources
Cost matters (cheaper than fine-tuning)
You want quick setup (no model training)

Fine-Tuning Is Better When:

You need to change the model's behavior fundamentally (writing style, domain knowledge)
You have high volumes of queries (cost of fine-tuning amortizes)
You want to teach the model new facts permanently
Your documents are very large (easier than RAG)

Long Context Windows Are Better When:

All relevant context fits in the model's context window (100K tokens for Claude, 200K for Gemini)
You don't need dynamic retrieval
Your data is private and you don't want a vector database

Popular RAG Frameworks

Don't build RAG from scratch. Use a framework:

LangChain - Comprehensive, chains components together, works with many models and vector stores
LlamaIndex - Focused on RAG specifically, strong on data indexing and retrieval
Haystack - Production-ready, modular, good documentation
LiteLLM - If you just need a simpler wrapper for LLM calls

Challenges and Gotchas

Chunk Size

Too small: chunks lack context, retrieval is scattered. Too large: only top results matter, chunking defeats the purpose. Sweet spot is usually 500-2000 characters with 100-400 character overlap.

Relevance

Vector search can retrieve irrelevant chunks (semantic noise). Use reranking: after retrieval, have the LLM rank results by relevance. Or use hybrid search combining keyword and semantic matching.

Citation

You want the LLM to cite which documents it used. Include source metadata in chunks and ask the LLM to include citations in its answer.

Hallucination

Even with context, LLMs sometimes ignore the provided information and make things up. Mitigate by asking the LLM to refuse to answer if the context doesn't support the answer.

Building Your First RAG System

Pick a framework (LangChain or LlamaIndex)
Choose a vector database (Chromadb for development, Pinecone for production)
Load your documents
Split and embed them
Build a retrieval chain
Test with sample questions
Measure accuracy and iterate on chunk size, embedding model, and retrieval parameters

Start simple. Add complexity (reranking, hybrid search, hierarchical retrieval) only when you need it.

Discussion

Loading…

← Back to learning path