RAG systems promise grounded, accurate LLM responses — but poorly evaluated pipelines deliver hallucinations with false confidence. Retrieval relevance failures, chunking misalignments, and generation-retrieval disconnects silently degrade answer quality. This checklist provides RAG engineers with a systematic framework to evaluate every stage of the retrieval-augmented generation pipeline.
Measure the percentage of relevant documents your retrieval system finds from the total relevant documents in your corpus. Use a labeled evaluation set of at least 100 queries with annotated relevant passages. Track recall@k for k=3, 5, 10, and 20.
Calculate the proportion of retrieved documents that are actually relevant to the query. High recall with low precision floods the context window with noise, degrading generation quality. Target precision@5 of at least 0.7 for production systems.
Benchmark dense vector retrieval against BM25/keyword search on your specific dataset. Neither approach universally wins; hybrid retrieval often outperforms both. Test on 100+ queries spanning simple lookups and complex semantic searches.
Evaluate whether your query processing correctly handles abbreviations, synonyms, and implicit intent. Test with queries like 'ROI of ML' where the system needs to understand domain abbreviations. Measure retrieval improvement from query expansion.
If using a reranker, measure its impact on retrieval quality versus direct embedding similarity. Cross-encoders should meaningfully improve precision@k, otherwise they add latency without value. Compare MRR and NDCG before and after reranking.
Test retrieval for questions requiring information from multiple documents. Single-chunk retrieval often fails for comparative or aggregation queries. Evaluate whether your system surfaces all necessary chunks for multi-hop questions.
Verify that recently ingested documents appear in retrieval results within your SLA window. Stale indices cause the RAG system to miss new information. Test ingestion-to-retrieval latency with timestamped test documents.
Benchmark your embedding model on domain-specific similarity tasks. General-purpose embeddings often underperform on specialized vocabulary. Create a domain-specific evaluation set and compare at least 3 embedding models.
Test with queries that have no relevant documents in the corpus. The system should return low-confidence results or no results rather than irrelevant passages. Measure false positive retrieval rate on 50+ out-of-scope queries.
Measure end-to-end retrieval latency including embedding generation, vector search, and reranking under realistic load. P50 should be under 200ms and P99 under 500ms for interactive applications. Profile each stage independently.
Test at least 3 chunk sizes (256, 512, 1024 tokens) on your specific corpus and measure downstream answer quality. Optimal chunk size varies by document type and query patterns. There is no universal best chunk size.
Verify that chunks do not break mid-sentence, mid-paragraph, or mid-table. Broken chunks lose context and produce fragmented retrieval results. Audit 100+ random chunks for boundary quality.
Test different chunk overlap sizes (0, 50, 100, 200 tokens) and measure their impact on retrieval recall. Overlap prevents information loss at boundaries but increases index size. Find the minimum overlap that maintains recall.
Evaluate whether different document types (PDFs, HTML, Markdown, slides) require different chunking strategies. Tables, code blocks, and lists often need special handling. Test chunking quality per document type.
Verify that chunks carry useful metadata (source document, section heading, page number, date) that aids retrieval and citation. Test whether metadata filters improve retrieval precision. Missing metadata degrades user trust in answers.
Evaluate parent-child chunk relationships where summaries or section headings are indexed alongside detail chunks. Hierarchical approaches can improve both recall and context quality. Compare against flat chunking on 50+ queries.
Test chunking and retrieval of tabular data, JSON structures, and other non-prose content. Standard text chunkers destroy table semantics. Verify that table-dependent queries retrieve intact, interpretable data.
Audit your index for duplicate or near-duplicate chunks that waste context window space and skew relevance scoring. Deduplicate at the chunk level and measure retrieval quality improvement. Even 5% duplication impacts quality.
Test your process for updating chunks when source documents change. Stale chunks, orphaned embeddings, and version conflicts all degrade quality. Verify that updates propagate correctly within your SLA.
Implement per-chunk quality metrics like retrieval frequency, click-through rate, and contribution to correct answers. Identify low-quality chunks that are frequently retrieved but rarely useful. Use analytics to prioritize re-chunking efforts.
Measure whether generated answers are faithful to the retrieved context — not introducing information absent from the provided chunks. Use automated faithfulness metrics like RAGAS or manual annotation. Faithfulness scores below 0.85 indicate serious grounding issues.
Evaluate whether the generated answer actually addresses the user's question, not just parroting retrieved content. A faithfully grounded but irrelevant answer is still a failure. Score relevance on a 1-5 scale across 100+ test queries.
Analyze how much of the retrieved context the model actually uses in its response. Under-utilization suggests retrieval noise; over-reliance on a single chunk suggests fragile retrieval. Track context coverage metrics per response.
Verify that inline citations and source references point to the correct chunks and that quoted text matches the source. Incorrect citations are worse than no citations because they create false trust. Audit 50+ cited responses.
Test whether the model appropriately says 'I don't know' when retrieved context is insufficient, rather than generating plausible but unsupported answers. Create 30+ test cases with intentionally sparse retrieval. Measure abstention accuracy.
Categorize hallucinations into intrinsic (contradicting retrieved context) and extrinsic (adding information not in context). Each type requires different mitigation strategies. Track the ratio and trend of each hallucination type.
Test at least 3 variations of your generation prompt template and measure their impact on faithfulness, relevance, and answer quality. Small wording changes can significantly shift grounding behavior. Document optimal prompts per query type.
For responses longer than 200 words, evaluate structural coherence, logical flow, and absence of contradictions within the response. Long-form generation often includes internal inconsistencies. Use human evaluation for responses exceeding 500 words.
Test queries that require comparing or synthesizing information from multiple retrieved chunks. The model should present a coherent synthesis, not a disjointed summary of each chunk. Evaluate on 20+ comparison queries.
Benchmark answer quality across temperature settings (0.0, 0.3, 0.7) for your specific use case. Lower temperatures improve faithfulness but may reduce fluency. Find the optimal tradeoff for your quality requirements.
Measure final answer correctness on a labeled dataset of at least 200 question-answer pairs specific to your domain. This is the north star metric that captures the cumulative effect of retrieval, chunking, and generation quality. Track weekly.
When the final answer is wrong, determine whether the failure originated in retrieval (wrong documents), chunking (lost context), or generation (model hallucination). This attribution guides where to invest improvement effort. Annotate 50+ failure cases.
Profile the latency contribution of each pipeline stage: query processing, embedding, retrieval, reranking, and generation. Set per-stage budgets that sum to your user-facing SLA. Identify which stages are latency bottlenecks.
Calculate the fully loaded cost per query including embedding calls, vector DB queries, reranker inference, and LLM generation tokens. Break down costs by query complexity tier. Identify the most expensive query patterns.
Verify that your A/B testing infrastructure correctly measures quality differences between pipeline configurations. Ensure statistical significance calculations are correct and that you have sufficient sample sizes. Test with synthetic known-difference experiments.
Maintain a curated set of 100+ test cases that cover known failure modes, edge cases, and critical business queries. Run this suite before every deployment. Add new cases whenever a production issue is discovered.
Implement thumbs up/down feedback on RAG responses and correlate feedback with retrieval and generation metrics. User feedback is the most reliable quality signal. Achieve a feedback rate of at least 5% of queries.
If supporting multiple languages, test the full pipeline for each language independently. Retrieval quality, chunking behavior, and generation quality often vary dramatically across languages. Benchmark each language separately.
Test pipeline performance under concurrent load to identify bottlenecks that only appear at scale. Vector databases, rerankers, and LLM APIs all have different scaling characteristics. Profile at 10x, 50x, and 100x normal load.
Implement structured logging and dashboards that track retrieval scores, generation quality metrics, latency, and costs in real time. Set alerts for anomalies. Without observability, quality degrades silently in production.
Review the quality of documents in your knowledge base for accuracy, recency, and completeness. Garbage in, garbage out applies forcefully to RAG systems. Establish a document quality scoring rubric and audit 10% of your corpus quarterly.
Map user queries to knowledge base content to identify coverage gaps — topics users ask about that have no corresponding documents. Coverage gaps guarantee hallucinations or unhelpful responses. Analyze query logs monthly for gaps.
Track the age of documents in your knowledge base and flag stale content that may contain outdated information. Set freshness SLAs per document category. Automate alerts when documents exceed their freshness threshold.
Test that your document processing pipeline (parsing, cleaning, chunking, embedding) handles all document formats correctly without data loss. Run a golden dataset of known documents through the pipeline and verify output integrity.
Identify and handle duplicate content that exists across multiple source documents. Duplicates waste index space and can cause contradictory retrieval when versions differ. Implement source-level deduplication in your ingestion pipeline.
Verify that the RAG system respects document-level access controls and does not surface restricted content to unauthorized users. Test with user personas that have different permission levels. This is a security-critical check.
Monitor for embedding quality degradation over time, especially after embedding model updates or large corpus changes. Compare retrieval quality on a stable test set before and after any embedding changes. Set drift thresholds.
Test your ability to roll back knowledge base content to a previous version if a bad update degrades quality. Version your chunks and embeddings alongside source documents. Verify rollback restores previous retrieval quality.
If your corpus includes images, diagrams, or charts, evaluate whether these are properly processed and made searchable. Text-only pipelines miss critical information in visual content. Test retrieval for visually dependent queries.
Test retrieval quality and latency as your corpus grows 2x, 5x, and 10x from current size. Some vector databases and chunking strategies degrade at scale. Plan capacity and architecture changes proactively.
Respan provides automated evaluation for every stage of your RAG pipeline — retrieval relevance, chunk quality, generation faithfulness, and end-to-end accuracy. Get component-level quality scores and pinpoint exactly where your pipeline needs improvement.
Try Respan free