What Is RAG and Why Does It Matter
The Core Problem RAG Solves
An LLM has a knowledge cutoff. It cannot answer questions about documents it has never seen. It cannot tell you what is in your company's internal wiki, your product documentation, last quarter's financial report, or a paper published last month.
You could fine-tune a model on that content. But fine-tuning is expensive, slow, and does not help with content that changes frequently. You would need to fine-tune again every time your documentation updates.
You could stuff all your documents into the context window. But context windows have limits, and large contexts are slow and expensive. For large document collections, you cannot fit everything in.
Retrieval-Augmented Generation solves this differently. Instead of baking knowledge into the model, you retrieve the most relevant pieces of knowledge at query time and give them to the model as context.
How RAG Works
RAG has two phases: ingestion and retrieval-generation.
Ingestion (offline):
- Load your documents (PDFs, HTML pages, plain text, code, etc.)
- Split them into smaller chunks
- Convert each chunk to a vector embedding
- Store the embeddings in a vector database, indexed for fast search
Retrieval-generation (online, at query time):
- A user asks a question
- Convert the question to an embedding using the same model
- Search the vector database for the chunks most similar to the question
- Assemble the retrieved chunks into a context block
- Send the question plus context to the LLM
- The LLM generates an answer grounded in the retrieved content
The key insight: the LLM does not need to remember the facts. It just needs to read the relevant facts when it is time to answer.
When RAG Is the Right Choice
RAG is well-suited when:
- Your knowledge base changes frequently (news, documentation, tickets, emails)
- You need the model to cite specific sources
- Your document collection is too large to fit in a context window
- You need answers grounded in specific, authoritative content
- You need to control what information the model has access to
RAG is not the right choice when:
- Your content is stable and small enough to fit in a context window (just put it in the prompt)
- The task requires deep behavioral change in the model (fine-tuning is better)
- You need the model to learn a new domain-specific reasoning style, not just facts
- Latency is extremely critical and the retrieval step adds too much overhead
RAG vs. Fine-Tuning vs. Long Context
These three approaches are not mutually exclusive. They address different problems.
Long context (prompt stuffing): Put the relevant documents directly in the prompt. Works well when the document set is small (under a few hundred pages) and you can afford the token cost of large contexts.
Fine-tuning: Train the model on your domain data. Best for style, tone, and domain-specific reasoning. Poor at keeping up with factual updates.
RAG: Retrieve relevant facts at query time. Best for large, frequently updated knowledge bases. Requires a retrieval infrastructure.
Many production systems combine approaches: a fine-tuned model with a RAG pipeline on top of it.
What You Will Learn in This Course
This course covers how to build a RAG system from scratch and make it work reliably in production. You will learn:
- Document ingestion and chunking strategies
- Embedding models and vector databases
- Retrieval design: vector search, hybrid search, and reranking
- Context assembly and generation prompting
- Retrieval evaluation and quality measurement
- Production observability and cost optimization
By the end, you will have the skills to build a RAG system that is not just functional in a demo, but defensible in production.
What You Need Before This Course
This course assumes:
- Experience with Python or a similar language
- Familiarity with calling LLM APIs
- Basic understanding of HTTP APIs and JSON
No ML background is required. No experience with vector databases or embedding models is assumed.
Discussion
Sign in to comment. Your account must be at least 1 day old.