Trace and evaluate a RAG pipeline

  1. Sign up - Create an account at platform.respan.ai
  2. Create an API key - Generate one on the API keys page
  3. Add credits or a provider key - Add credits on the Credits page or connect your own provider key on the Integrations page

Overview

RAG is not a single step -it’s a three-step pipeline. A user query flows through a routing decision, a retrieval step, and a generation step before producing a final answer. Each step has its own failure modes, its own levers for improvement, and its own evaluation criteria.

Treating RAG as a black box and measuring only the final answer makes the system nearly impossible to improve. A bad response could mean the LLM made the wrong routing decision, the retriever returned irrelevant context, or the generator ignored what was retrieved. The fix is different in each case -and you can’t tell them apart from the output alone.

How Respan sees your RAG pipeline

Everything in Respan is built on spans. A span captures a single unit of work: its input, output, latency, and metadata. When your RAG pipeline runs, Respan captures it as three sequential spans and groups them as a single trace:

trace: rag_pipeline
├── span: llm_routing input: user query output: tool call decision
├── span: rag_retrieval input: user query output: retrieved context
└── span: llm_generation input: query + context output: final answer

This trace structure matters because it makes the intermediate states visible. You can see exactly what routing decision the LLM made, exactly what context came back from retrieval, and exactly what the generator received before producing its answer.

Evaluation maps directly to spans

Each span in the trace can have multiple evaluators attached to it -an LLM judge, a human reviewer, or a code function. The mechanism is the same regardless of which step you’re evaluating.

SpanEvaluatorDescription
First llm_callrag_routing_accuracyDid the LLM correctly decide to call the RAG tool for this query?
unnecessary_retrievalDid the LLM call RAG when the answer didn’t require external knowledge?
rag_toolcontext_relevanceIs the retrieved context relevant to the user query?
context_completenessDoes the retrieved context contain enough information to answer the query?
Second llm_callgroundednessDoes the answer stay within the bounds of the retrieved context?
context_utilizationDid the LLM actually use the retrieved context, or ignore it?

Trace your RAG pipeline

Instrument your pipeline with the Respan SDK and run your queries -traces are automatically captured and can be sampled into a dataset.

Set your environment variables:

$export RESPAN_API_KEY="your_respan_api_key"
$export OPENAI_API_KEY="your_openai_api_key"

Install dependencies:

$pip install respan-tracing openai chromadb
1from openai import OpenAI
2from respan_tracing import RespanTelemetry
3from respan_tracing.decorators import workflow, task
4import chromadb
5
6# Initialize
7telemetry = RespanTelemetry()
8client = OpenAI()
9chroma = chromadb.Client()
10collection = chroma.get_or_create_collection("docs")
11
12# Seed some documents
13collection.add(
14 documents=[
15 "Respan supports 250+ LLM models through a unified API gateway.",
16 "Traces organize logs into hierarchical workflows with parent-child spans.",
17 "Evaluators can be LLM-based, code-based, or human reviewers.",
18 "Automations run evaluators on production logs in real-time.",
19 ],
20 ids=["doc1", "doc2", "doc3", "doc4"],
21)
22
23
24@task(name="llm_routing")
25def route_query(query: str):
26 """Decide whether to call the RAG tool for this query."""
27 completion = client.chat.completions.create(
28 model="gpt-4o-mini",
29 messages=[
30 {
31 "role": "system",
32 "content": (
33 "You are a routing agent. Decide if this query requires "
34 "looking up external knowledge. If it does, call the "
35 "retrieve_context tool. Otherwise, respond directly."
36 ),
37 },
38 {"role": "user", "content": query},
39 ],
40 tools=[
41 {
42 "type": "function",
43 "function": {
44 "name": "retrieve_context",
45 "description": "Search the knowledge base for relevant information",
46 "parameters": {
47 "type": "object",
48 "properties": {
49 "query": {
50 "type": "string",
51 "description": "The search query",
52 }
53 },
54 "required": ["query"],
55 },
56 },
57 }
58 ],
59 )
60 return completion.choices[0].message
61
62
63@task(name="rag_retrieval")
64def retrieve_context(query: str, top_k: int = 3):
65 """Retrieve relevant documents from ChromaDB."""
66 results = collection.query(query_texts=[query], n_results=top_k)
67 return results["documents"][0]
68
69
70@task(name="llm_generation")
71def generate_answer(query: str, context: list[str]):
72 """Generate an answer using retrieved context."""
73 context_str = "\n".join(f"- {doc}" for doc in context)
74
75 completion = client.chat.completions.create(
76 model="gpt-4o-mini",
77 messages=[
78 {
79 "role": "system",
80 "content": f"Answer based on this context:\n{context_str}",
81 },
82 {"role": "user", "content": query},
83 ],
84 )
85 return completion.choices[0].message.content
86
87
88@workflow(name="rag_pipeline")
89def rag_pipeline(query: str):
90 """Full RAG pipeline: route → retrieve → generate."""
91 routing = route_query(query)
92
93 if routing.tool_calls:
94 context = retrieve_context(query)
95 else:
96 context = []
97
98 answer = generate_answer(query, context)
99 return answer
100
101
102# Run it
103result = rag_pipeline("How does Respan handle tracing?")
104print(result)

Once running, open the Traces page and you’ll see each request as a trace with child spans:

trace: rag_pipeline
├── span: llm_routing input: user query output: tool call decision
├── span: rag_retrieval input: user query output: retrieved context
└── span: llm_generation input: query + context output: final answer

Each span’s input and output are captured automatically -no manual logging needed.

To build a dataset from your traces, go to the Spans page, filter by workflow name and task name, then export to a dataset.

Build a dataset

The simplest dataset is end-to-end: the initial user query as input and the final answer as output.

inputoutput
user queryfinal answer

Use this for regression testing or quick sanity checks when you don’t need to diagnose which step failed.

To evaluate individual steps, create a separate dataset for each span:

Routing (llm_routing span)

inputoutput
user querytool call decision (did it call RAG?)

Retrieval (rag_retrieval span)

inputoutput
user queryretrieved context

Generation (llm_generation span)

inputoutput
user query + retrieved contextfinal answer

Filter by workflow name and task name on the Spans page, then export each span type as its own dataset.


Evaluate your RAG pipeline

With your data in Respan, evaluation has three steps: create a dataset from your traces, set up evaluators for each span, then run experiments to score them.

Step 1 - Create a dataset

Go to the Spans page and filter by workflow name and task name to isolate the spans you want to evaluate.

For example, to evaluate your retrieval step:

  • Workflow name: rag_pipeline
  • Task name: rag_retrieval

Apply any additional filters -date range, metadata, specific users -then export to a dataset.

Repeat for each step you want to evaluate. You’ll end up with up to three datasets, one per span.

Step 2 - Set up evaluators

Create an evaluator for each dataset. Respan supports three evaluator types:

  • LLM evaluator -an LLM judges the span output against a rubric you define. Use {{input}} and {{output}} to reference the span’s data in your prompt.
  • Human evaluator -your team reviews and scores outputs manually.
  • Code evaluator -a Python function checks the output deterministically.

Here are starter prompts for each RAG eval step:

Given this user query:
{{input}}
The LLM responded with:
{{output}}
Did the LLM correctly decide whether to call the RAG tool?
A query that requires external or factual knowledge should trigger RAG.
A query that can be answered from general knowledge should not.
Return PASS if the routing decision was correct, FAIL if not.
Explain your reasoning.
Given this user query:
{{input}}
The retriever returned this context:
{{output}}
Is the retrieved context relevant and sufficient to answer the query?
Return PASS if the context is relevant and complete, FAIL if it is
irrelevant, incomplete, or missing key information.
Explain your reasoning.
Given this input:
{{input}}
The model produced this answer:
{{output}}
Does the answer stay within the bounds of the retrieved context?
Flag any claims that are not supported by or contradict the context.
Return PASS if the answer is fully grounded, FAIL if it introduces
information not present in the context.
Explain your reasoning.

Step 3 - Run an experiment

Select a dataset, attach your evaluators, and run. Each row gets a PASS/FAIL or a score, and Respan shows the aggregate pass rate across the dataset.

You can run multiple experiments to compare different prompt versions, models, retrieval configurations, or any other parameter.

Evaluating routing

You want to know: is the LLM calling RAG when it should, and skipping it when it shouldn’t?

  1. Select your llm_routing dataset
  2. Attach the rag_routing_accuracy and unnecessary_retrieval evaluators
  3. Run the experiment

A low rag_routing_accuracy score means the LLM is missing queries that need external knowledge -your tool description or system prompt likely needs to be more explicit about when to trigger RAG. A low unnecessary_retrieval score means the LLM is over-retrieving -adding latency and noise to queries that didn’t need it.

Evaluating retrieval

You want to know: is the retriever returning context that’s actually relevant to the query?

  1. Select your rag_retrieval dataset
  2. Attach the context_relevance and context_completeness evaluators
  3. Run the experiment

A low context_relevance score means your retriever is returning the wrong documents -look at your embedding model, chunking strategy, or similarity threshold. A low context_completeness score means the right documents are there but the answer requires information spread across more chunks than you’re retrieving -try increasing Top-K.

Evaluating grounding

You want to know: is the generator staying within the bounds of what was retrieved, or hallucinating beyond it?

  1. Select your llm_generation dataset
  2. Attach the groundedness and context_utilization evaluators
  3. Run the experiment

A low groundedness score means the model is introducing claims not supported by the retrieved context -tighten your system prompt to constrain the model to the provided documents. A low context_utilization score means the model is ignoring the retrieved context entirely and answering from parametric knowledge -check that the context is being correctly passed into the prompt.

Evaluating end-to-end

You want a quick regression check across your whole pipeline without diagnosing individual steps.

  1. Select your end-to-end dataset (user query → final answer)
  2. Attach a general quality evaluator or a custom rubric
  3. Run the experiment

Use this to catch regressions after prompt changes or model upgrades. When scores drop, use the step-level datasets to diagnose which part of the pipeline broke.

Next steps