Trace and evaluate a RAG pipeline

Set up Respan

Sign up - Create an account at platform.respan.ai
Create an API key - Generate one on the API keys page
Add credits or a provider key - Add credits on the Credits page or connect your own provider key on the Integrations page

Overview

RAG is not a single step -it’s a three-step pipeline. A user query flows through a routing decision, a retrieval step, and a generation step before producing a final answer. Each step has its own failure modes, its own levers for improvement, and its own evaluation criteria.

Treating RAG as a black box and measuring only the final answer makes the system nearly impossible to improve. A bad response could mean the LLM made the wrong routing decision, the retriever returned irrelevant context, or the generator ignored what was retrieved. The fix is different in each case -and you can’t tell them apart from the output alone.

How Respan sees your RAG pipeline

Everything in Respan is built on spans. A span captures a single unit of work: its input, output, latency, and metadata. When your RAG pipeline runs, Respan captures it as three sequential spans and groups them as a single trace:

trace: rag_pipeline
  ├── span: llm_routing      input: user query      output: tool call decision
  ├── span: rag_retrieval    input: user query      output: retrieved context
  └── span: llm_generation   input: query + context output: final answer

This trace structure matters because it makes the intermediate states visible. You can see exactly what routing decision the LLM made, exactly what context came back from retrieval, and exactly what the generator received before producing its answer.

Evaluation maps directly to spans

Each span in the trace can have multiple evaluators attached to it -an LLM judge, a human reviewer, or a code function. The mechanism is the same regardless of which step you’re evaluating.

Span	Evaluator	Description
First `llm_call`	`rag_routing_accuracy`	Did the LLM correctly decide to call the RAG tool for this query?
	`unnecessary_retrieval`	Did the LLM call RAG when the answer didn’t require external knowledge?
`rag_tool`	`context_relevance`	Is the retrieved context relevant to the user query?
	`context_completeness`	Does the retrieved context contain enough information to answer the query?
Second `llm_call`	`groundedness`	Does the answer stay within the bounds of the retrieved context?
	`context_utilization`	Did the LLM actually use the retrieved context, or ignore it?

Trace your RAG pipeline

Instrument your pipeline with the Respan SDK and run your queries -traces are automatically captured and can be sampled into a dataset.

Set your environment variables:

$ export RESPAN_API_KEY="your_respan_api_key"
$ export OPENAI_API_KEY="your_openai_api_key"

ChromaDB

Milvus

LlamaIndex

LangChain

Install dependencies:

$ pip install respan-tracing openai chromadb

1 from openai import OpenAI
2 from respan_tracing import RespanTelemetry
3 from respan_tracing.decorators import workflow, task
4 import chromadb
5 
6 # Initialize
7 telemetry = RespanTelemetry()
8 client = OpenAI()
9 chroma = chromadb.Client()
10 collection = chroma.get_or_create_collection("docs")
11 
12 # Seed some documents
13 collection.add(
14     documents=[
15         "Respan supports 250+ LLM models through a unified API gateway.",
16         "Traces organize logs into hierarchical workflows with parent-child spans.",
17         "Evaluators can be LLM-based, code-based, or human reviewers.",
18         "Automations run evaluators on production logs in real-time.",
19     ],
20     ids=["doc1", "doc2", "doc3", "doc4"],
21 )
22 
23 
24 @task(name="llm_routing")
25 def route_query(query: str):
26     """Decide whether to call the RAG tool for this query."""
27     completion = client.chat.completions.create(
28         model="gpt-4o-mini",
29         messages=[
30             {
31                 "role": "system",
32                 "content": (
33                     "You are a routing agent. Decide if this query requires "
34                     "looking up external knowledge. If it does, call the "
35                     "retrieve_context tool. Otherwise, respond directly."
36                 ),
37             },
38             {"role": "user", "content": query},
39         ],
40         tools=[
41             {
42                 "type": "function",
43                 "function": {
44                     "name": "retrieve_context",
45                     "description": "Search the knowledge base for relevant information",
46                     "parameters": {
47                         "type": "object",
48                         "properties": {
49                             "query": {
50                                 "type": "string",
51                                 "description": "The search query",
52                             }
53                         },
54                         "required": ["query"],
55                     },
56                 },
57             }
58         ],
59     )
60     return completion.choices[0].message
61 
62 
63 @task(name="rag_retrieval")
64 def retrieve_context(query: str, top_k: int = 3):
65     """Retrieve relevant documents from ChromaDB."""
66     results = collection.query(query_texts=[query], n_results=top_k)
67     return results["documents"][0]
68 
69 
70 @task(name="llm_generation")
71 def generate_answer(query: str, context: list[str]):
72     """Generate an answer using retrieved context."""
73     context_str = "\n".join(f"- {doc}" for doc in context)
74 
75     completion = client.chat.completions.create(
76         model="gpt-4o-mini",
77         messages=[
78             {
79                 "role": "system",
80                 "content": f"Answer based on this context:\n{context_str}",
81             },
82             {"role": "user", "content": query},
83         ],
84     )
85     return completion.choices[0].message.content
86 
87 
88 @workflow(name="rag_pipeline")
89 def rag_pipeline(query: str):
90     """Full RAG pipeline: route → retrieve → generate."""
91     routing = route_query(query)
92 
93     if routing.tool_calls:
94         context = retrieve_context(query)
95     else:
96         context = []
97 
98     answer = generate_answer(query, context)
99     return answer
100 
101 
102 # Run it
103 result = rag_pipeline("How does Respan handle tracing?")
104 print(result)

Once running, open the Traces page and you’ll see each request as a trace with child spans:

trace: rag_pipeline
  ├── span: llm_routing      input: user query      output: tool call decision
  ├── span: rag_retrieval    input: user query      output: retrieved context
  └── span: llm_generation   input: query + context output: final answer

Each span’s input and output are captured automatically -no manual logging needed.

To build a dataset from your traces, go to the Spans page, filter by workflow name and task name, then export to a dataset.

Build a dataset

The simplest dataset is end-to-end: the initial user query as input and the final answer as output.

`input`	`output`
user query	final answer

Use this for regression testing or quick sanity checks when you don’t need to diagnose which step failed.

Span-level datasets

To evaluate individual steps, create a separate dataset for each span:

Routing (llm_routing span)

`input`	`output`
user query	tool call decision (did it call RAG?)

Retrieval (rag_retrieval span)

`input`	`output`
user query	retrieved context

Generation (llm_generation span)

`input`	`output`
user query + retrieved context	final answer

Filter by workflow name and task name on the Spans page, then export each span type as its own dataset.

Evaluate your RAG pipeline

With your data in Respan, evaluation has three steps: create a dataset from your traces, set up evaluators for each span, then run experiments to score them.

Step 1 - Create a dataset

Go to the Spans page and filter by workflow name and task name to isolate the spans you want to evaluate.

For example, to evaluate your retrieval step:

Workflow name: rag_pipeline
Task name: rag_retrieval

Apply any additional filters -date range, metadata, specific users -then export to a dataset.

Repeat for each step you want to evaluate. You’ll end up with up to three datasets, one per span.

Step 2 - Set up evaluators

Create an evaluator for each dataset. Respan supports three evaluator types:

LLM evaluator -an LLM judges the span output against a rubric you define. Use {{input}} and {{output}} to reference the span’s data in your prompt.
Human evaluator -your team reviews and scores outputs manually.
Code evaluator -a Python function checks the output deterministically.

Here are starter prompts for each RAG eval step:

Routing eval (llm_routing span)

Given this user query:
{{input}}
The LLM responded with:
{{output}}
Did the LLM correctly decide whether to call the RAG tool?
A query that requires external or factual knowledge should trigger RAG.
A query that can be answered from general knowledge should not.
Return PASS if the routing decision was correct, FAIL if not.
Explain your reasoning.

Retrieval eval (rag_retrieval span)

Given this user query:
{{input}}
The retriever returned this context:
{{output}}
Is the retrieved context relevant and sufficient to answer the query?
Return PASS if the context is relevant and complete, FAIL if it is
irrelevant, incomplete, or missing key information.
Explain your reasoning.

Grounding eval (llm_generation span)

Given this input:
{{input}}
The model produced this answer:
{{output}}
Does the answer stay within the bounds of the retrieved context?
Flag any claims that are not supported by or contradict the context.
Return PASS if the answer is fully grounded, FAIL if it introduces
information not present in the context.
Explain your reasoning.

Step 3 - Run an experiment

Select a dataset, attach your evaluators, and run. Each row gets a PASS/FAIL or a score, and Respan shows the aggregate pass rate across the dataset.

You can run multiple experiments to compare different prompt versions, models, retrieval configurations, or any other parameter.

Evaluating routing

You want to know: is the LLM calling RAG when it should, and skipping it when it shouldn’t?

Select your llm_routing dataset
Attach the rag_routing_accuracy and unnecessary_retrieval evaluators
Run the experiment

A low rag_routing_accuracy score means the LLM is missing queries that need external knowledge -your tool description or system prompt likely needs to be more explicit about when to trigger RAG. A low unnecessary_retrieval score means the LLM is over-retrieving -adding latency and noise to queries that didn’t need it.

Evaluating retrieval

You want to know: is the retriever returning context that’s actually relevant to the query?

Select your rag_retrieval dataset
Attach the context_relevance and context_completeness evaluators
Run the experiment

A low context_relevance score means your retriever is returning the wrong documents -look at your embedding model, chunking strategy, or similarity threshold. A low context_completeness score means the right documents are there but the answer requires information spread across more chunks than you’re retrieving -try increasing Top-K.

Evaluating grounding

You want to know: is the generator staying within the bounds of what was retrieved, or hallucinating beyond it?

Select your llm_generation dataset
Attach the groundedness and context_utilization evaluators
Run the experiment

A low groundedness score means the model is introducing claims not supported by the retrieved context -tighten your system prompt to constrain the model to the provided documents. A low context_utilization score means the model is ignoring the retrieved context entirely and answering from parametric knowledge -check that the context is being correctly passed into the prompt.

Evaluating end-to-end

You want a quick regression check across your whole pipeline without diagnosing individual steps.

Select your end-to-end dataset (user query → final answer)
Attach a general quality evaluator or a custom rubric
Run the experiment

Use this to catch regressions after prompt changes or model upgrades. When scores drop, use the step-level datasets to diagnose which part of the pipeline broke.

Next steps

Datasets

Create a dataset from your traces

Evaluators

Set up evaluators for each span

Experiments

Run experiments and compare results