Concepts | Respan Docs

What is evaluation?

Evaluation in Respan lets you measure and improve the quality of your LLM and agent outputs. The system supports two modes:

Offline evaluation: build datasets, run experiments against different prompt versions or models, and compare scores.
Online evaluation: run evaluators on live production traffic to score outputs and catch regressions in real time.

Scores

A score is the output of an evaluation. It is attached to a specific span and represents a quality assessment for that LLM request.

Score value types

Value type	Field	Data type	Example
`numerical`	`numerical_value`	number	`4.5`
`boolean`	`boolean_value`	boolean	`true`
`categorical`	`categorical_value`	array of strings	`["excellent", "coherent"]`
`comment`	`string_value`	string	`"Good response quality"`

Numerical scores are best for ratings and quality metrics. The range is defined by the grader’s min_score and max_score.

Boolean scores work for pass/fail evaluations like content safety checks.

Categorical scores handle multi-choice classifications from predefined choices.

Comment scores capture qualitative feedback as free-form text.

Key fields

log_id: links the score to its span
evaluator_id or evaluator_slug: identifies which evaluator produced the score
is_passed: whether the evaluation passed defined criteria

Each evaluator can only produce one score per span.

Graders and evaluators

Respan splits scoring into two layers:

A grader is a single scoring unit. It answers one question about an output. A grader can be an LLM judge (a model scores against a definition you write), a code grader (a main(eval_inputs) Python function returns a score), or a human grader (a person reviews in the platform).
An evaluator is the workflow that runs one or more graders, optionally routes between them with conditions, combines their scores, and produces one final score per output.

A simple evaluator is a single grader. A more advanced one chains graders with conditions, compute blocks (averages, weighted scores), and metrics. Evaluators are versioned: you can test, deploy, and roll back.

See Evaluators for grader setup, the block reference, and step-by-step walkthroughs.

Datasets

A dataset is a table of test cases for your experiments. Each row has three fields that matter for evaluation: input (the test case, required), expected_output (the reference answer, optional), and output (a response already produced for the input, optional).

Which fields you need depends on the experiment Task type: Prompt and Model generate fresh outputs from each row’s input, so they need inputs, while Dataset outputs scores the output already stored on each row.

You can build a dataset in two ways:

From production data: sample spans from your traced data using filters. Go to Datasets, click Create dataset, and use Insert by sampling to pull spans matching your criteria.

From CSV: upload a CSV file and map each column to a dataset field such as input, expected_output, or output. Use Insert from CSV.

See Datasets for details.

Experiments

An experiment is an offline evaluation run. It generates outputs by running every row in your dataset through a prompt or model, then runs the evaluator workflow over each output to produce scores, and aggregates the results.

You can use experiments to:

Compare different prompt versions on the same dataset
Compare different models on the same task
Test configuration changes (temperature, system prompt, etc.)

Results show per-row scores and aggregate metrics so you can pick the best configuration before deploying.

See Experiments for the full guide.

Online evals

Online evaluation runs evaluators on live production traffic. Instead of building a dataset and running experiments manually, you attach evaluators to incoming spans and score them automatically as they arrive.

Use online evals to:

Monitor output quality in real time
Catch regressions after prompt or model changes
Flag low-scoring outputs for human review
Get notified via webhooks when scores drop below thresholds

See Online evals for setup.

How it fits together

Online: production data goes directly to evaluators, which produce scores in real time. Scores feed into monitoring and notifications so quality issues surface immediately.
Offline: sample production spans into datasets (or upload CSV), run experiments against different prompt versions or models, evaluators score the outputs, and you compare results to pick the best configuration.