Trace and evaluate a RAG pipeline
Set up Respan
- Sign up - Create an account at platform.respan.ai
- Create an API key - Generate one on the API keys page
- Add credits or a provider key - Add credits on the Credits page or connect your own provider key on the Integrations page
Overview
RAG is not a single step -it’s a three-step pipeline. A user query flows through a routing decision, a retrieval step, and a generation step before producing a final answer. Each step has its own failure modes, its own levers for improvement, and its own evaluation criteria.
Treating RAG as a black box and measuring only the final answer makes the system nearly impossible to improve. A bad response could mean the LLM made the wrong routing decision, the retriever returned irrelevant context, or the generator ignored what was retrieved. The fix is different in each case -and you can’t tell them apart from the output alone.
How Respan sees your RAG pipeline
Everything in Respan is built on spans. A span captures a single unit of work: its input, output, latency, and metadata. When your RAG pipeline runs, Respan captures it as three sequential spans and groups them as a single trace:
This trace structure matters because it makes the intermediate states visible. You can see exactly what routing decision the LLM made, exactly what context came back from retrieval, and exactly what the generator received before producing its answer.
Evaluation maps directly to spans
Each span in the trace can have multiple evaluators attached to it -an LLM judge, a human reviewer, or a code function. The mechanism is the same regardless of which step you’re evaluating.
Trace your RAG pipeline
Instrument your pipeline with the Respan SDK and run your queries -traces are automatically captured and can be sampled into a dataset.
Set your environment variables:
ChromaDB
Milvus
LlamaIndex
LangChain
Install dependencies:
Once running, open the Traces page and you’ll see each request as a trace with child spans:
Each span’s input and output are captured automatically -no manual logging needed.
To build a dataset from your traces, go to the Spans page, filter by workflow name and task name, then export to a dataset.
Build a dataset
The simplest dataset is end-to-end: the initial user query as input and the final answer as output.
Use this for regression testing or quick sanity checks when you don’t need to diagnose which step failed.
Span-level datasets
To evaluate individual steps, create a separate dataset for each span:
Routing (llm_routing span)
Retrieval (rag_retrieval span)
Generation (llm_generation span)
Filter by workflow name and task name on the Spans page, then export each span type as its own dataset.
Evaluate your RAG pipeline
With your data in Respan, evaluation has three steps: create a dataset from your traces, set up evaluators for each span, then run experiments to score them.
Step 1 - Create a dataset
Go to the Spans page and filter by workflow name and task name to isolate the spans you want to evaluate.
For example, to evaluate your retrieval step:
- Workflow name:
rag_pipeline - Task name:
rag_retrieval
Apply any additional filters -date range, metadata, specific users -then export to a dataset.
Repeat for each step you want to evaluate. You’ll end up with up to three datasets, one per span.
Step 2 - Set up evaluators
Create an evaluator for each dataset. Respan supports three evaluator types:
- LLM evaluator -an LLM judges the span output against a rubric you define. Use
{{input}}and{{output}}to reference the span’s data in your prompt. - Human evaluator -your team reviews and scores outputs manually.
- Code evaluator -a Python function checks the output deterministically.
Here are starter prompts for each RAG eval step:
Routing eval (llm_routing span)
Retrieval eval (rag_retrieval span)
Grounding eval (llm_generation span)
Step 3 - Run an experiment
Select a dataset, attach your evaluators, and run. Each row gets a PASS/FAIL or a score, and Respan shows the aggregate pass rate across the dataset.
You can run multiple experiments to compare different prompt versions, models, retrieval configurations, or any other parameter.
Evaluating routing
You want to know: is the LLM calling RAG when it should, and skipping it when it shouldn’t?
- Select your
llm_routingdataset - Attach the
rag_routing_accuracyandunnecessary_retrievalevaluators - Run the experiment
A low rag_routing_accuracy score means the LLM is missing queries that need external knowledge -your tool description or system prompt likely needs to be more explicit about when to trigger RAG. A low unnecessary_retrieval score means the LLM is over-retrieving -adding latency and noise to queries that didn’t need it.
Evaluating retrieval
You want to know: is the retriever returning context that’s actually relevant to the query?
- Select your
rag_retrievaldataset - Attach the
context_relevanceandcontext_completenessevaluators - Run the experiment
A low context_relevance score means your retriever is returning the wrong documents -look at your embedding model, chunking strategy, or similarity threshold. A low context_completeness score means the right documents are there but the answer requires information spread across more chunks than you’re retrieving -try increasing Top-K.
Evaluating grounding
You want to know: is the generator staying within the bounds of what was retrieved, or hallucinating beyond it?
- Select your
llm_generationdataset - Attach the
groundednessandcontext_utilizationevaluators - Run the experiment
A low groundedness score means the model is introducing claims not supported by the retrieved context -tighten your system prompt to constrain the model to the provided documents. A low context_utilization score means the model is ignoring the retrieved context entirely and answering from parametric knowledge -check that the context is being correctly passed into the prompt.
Evaluating end-to-end
You want a quick regression check across your whole pipeline without diagnosing individual steps.
- Select your end-to-end dataset (user query → final answer)
- Attach a general quality evaluator or a custom rubric
- Run the experiment
Use this to catch regressions after prompt changes or model upgrades. When scores drop, use the step-level datasets to diagnose which part of the pipeline broke.