What RAG actually is
RAG stands for Retrieval-Augmented Generation. The pattern is simple: before you ask the LLM a question, you fetch the most relevant documents from your own data and paste them into the prompt. The model then answers based on real source material instead of guessing from its training data.
That is the whole idea. Retrieve, then generate. Everything else (vector databases, embeddings, chunking strategies) is just plumbing to make the retrieval step work well.
Why RAG exists
LLMs hallucinate when they do not have the facts. Ask a generic model about your company's refund policy and it will invent something plausible. RAG fixes this by grounding the answer in an authoritative source you control. If your knowledge base says returns are accepted within 30 days, the model sees that text in the prompt and uses it directly.
This is the difference between a chatbot that sounds confident and one that is actually correct.
The simplest RAG (no vector DB)
You do not need a vector database to do RAG. For small knowledge bases, a Python dict and a substring match is enough:
KB = {
"refunds": "Returns within 30 days of delivery. Original packaging required.",
"shipping": "Standard 3-5 business days, expedited 1-2 days.",
"warranty": "1-year limited warranty on all electronics.",
# ...
}
def answer(question):
# naive keyword match
for topic, content in KB.items():
if topic in question.lower():
context = content
break
else:
context = "(no relevant info)"
prompt = f"Context: {context}\n\nQuestion: {question}"
# then call LLM with context...This works fine for around 50 documents. The keyword match is dumb, but if your topics are well-named and your users ask reasonable questions, you will get correct answers most of the time. Ship this first. Add complexity only when it breaks.
When you actually need a vector DB
Three signals tell you the naive approach has run out:
- Your knowledge base has more than around 100 documents and substring matching starts missing relevant content.
- Queries are descriptive (a user asks "how long do I wait for a refund") instead of keyword-based ("refund policy").
- Your KB updates frequently and you need a system that handles re-indexing without manual work.
At that point you need embeddings. An embedding is a vector (a list of numbers) that represents the meaning of a piece of text. Two pieces of text with similar meaning produce similar vectors, even when they share no words. A vector database stores these vectors and lets you find the closest matches to a query vector quickly.
Picking a vector DB
| DB | When to pick it | Hosted? |
|---|---|---|
| pgvector | You already use Postgres, KB up to 1M chunks | Self-host or any managed Postgres |
| Pinecone | Want fully managed, scale beyond 1M chunks, low ops | Hosted only |
| Turbopuffer | Cost-sensitive at scale, blob-storage backed | Hosted |
| Weaviate | Want hybrid search and rich filtering, open-source | Both |
| Chroma | Local dev, quick prototyping | Local mostly |
The honest default is pgvector. If you already run Postgres, adding the extension is one command and your operational surface does not grow. Most teams never outgrow it. Reach for Pinecone or Turbopuffer when you have a real scale or latency reason, not because the docs look nice.
The retrieval pattern in code
Here is what real retrieval looks like once you have embeddings stored:
from openai import OpenAI
client = OpenAI()
def embed(text):
return client.embeddings.create(
model="text-embedding-3-small",
input=text,
).data[0].embedding
def cosine(a, b):
dot = sum(x * y for x, y in zip(a, b))
na = sum(x * x for x in a) ** 0.5
nb = sum(y * y for y in b) ** 0.5
return dot / (na * nb)
def retrieve(question, kb_chunks, kb_embeddings, k=3):
q_emb = embed(question)
scores = [cosine(q_emb, e) for e in kb_embeddings]
top_indices = sorted(range(len(scores)), key=lambda i: -scores[i])[:k]
return [kb_chunks[i] for i in top_indices]
def answer(question, kb_chunks, kb_embeddings):
context = retrieve(question, kb_chunks, kb_embeddings)
prompt = f"Context:\n{chr(10).join(context)}\n\nQuestion: {question}"
return client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
).choices[0].message.contentIn production you replace the in-memory list with a vector DB query, but the shape stays the same: embed the question, find the top k closest chunks, paste them into the prompt.
Hybrid retrieval (lexical plus semantic)
Pure semantic search misses queries where exact keywords matter. "Return policy for shoes" needs both an exact match on "shoes" (lexical) and a meaning match on "return policy" (semantic). Pure cosine similarity on embeddings can rank a generic "returns" document above the shoe-specific one because semantically they look similar.
The fix is hybrid retrieval: run a keyword search (BM25 is the standard) and a vector search in parallel, then merge the results. Most modern vector DBs support this natively. Turn it on when you see your top-k results miss keyword-specific queries on your eval set.
RAG quality is a chunking problem
The single biggest lever in RAG quality is not which database you pick. It is how you split your documents into chunks before embedding them.
Chunks that are too small lose context. Chunks that are too large dilute the embedding and waste prompt tokens on irrelevant text. A good starting point: 500 to 1000 tokens per chunk with 50 to 100 tokens of overlap between adjacent chunks. The overlap matters because important sentences often sit on the boundary between two chunks.
Tune chunk size against your eval set. If retrieval recall is low, try smaller chunks. If the model answers correctly but cites the wrong source, try larger chunks with more overlap.
The decision tree
A rough guide for where to land based on your data size:
- KB up to 30 docs: stuff everything into the prompt. No retrieval needed.
- KB 30 to 100 docs: naive keyword or substring match. Ship it.
- KB 100+ docs: real RAG with embeddings and a vector DB. Default to pgvector.
- Recall below 80% on your golden eval: add hybrid retrieval (BM25 plus vector).
Do not skip steps. A team running pgvector with 80% recall will ship faster and debug easier than a team running Pinecone with 60% recall and three reranking layers.
Tracing is non-negotiable
Retrieval failures are the number one cause of bad RAG answers. The model is rarely the problem. The chunks you fed it were wrong, or missing, or buried under irrelevant ones.
You cannot debug this without seeing the retrieved context for every query in production. Log it. Trace it. Build evals on it. We covered this in Chapter 1.4: LLM workflows and tracing and it applies twice as much to RAG systems.
Next
In the next chapter we look at the other side of grounding a model in your data: Chapter 2.4: Should You Fine-Tune an LLM?. After that, we put it all together in Chapter 2.5: Choosing the Right Stack.
