Evaluators

Set up graders and evaluator workflows to score your AI outputs.

Evaluators

Set up graders and evaluator workflows to score your AI outputs.

What is an evaluator?

An evaluator is a workflow built from graders (the scoring units) connected with conditions. Create them in the Evaluators page, then trigger from experiments or online evals.

Evaluator builder: create grader, edit grader, use blocks, create workflow, deploy

The evaluator builder has five steps:

Create grader: define your scoring metrics in the grader section. Each grader has a name, output data type, score range, and passing score.
Edit grader: configure the scoring logic. Write an LLM definition, Python code, or both. A single grader can include all evaluation types at once.
Use blocks: drag blocks from the palette onto the canvas. Blocks include markers, conditions, graders, compute, metrics, and constants.
Create evaluator workflow: connect blocks on the canvas to define the evaluation flow. Chain graders with conditions to build routing logic.
Deploy evaluator: test the full flow with Test run, then click Deploy to publish as a versioned evaluator.

Example

Simple: Original input -> LLM grader -> Final result.

Advanced: In this example, the “Response Quality” grader is configured with both an LLM definition and code evaluation. The workflow checks cost first: if cost > 5, it runs the LLM grader and routes low scores (< 3) to a human reviewer. Otherwise, it averages the LLM and code grader scores as the final result.

Original input
  |
  If (Cost > 5)
  |
  |- Then:
  |    If (Response Quality [LLM] < 3)
  |    |- Then: Final result = Response Quality [Human]
  |
  |- Else:
       Final result = Average of 2 values:
         Response Quality [LLM], Response Quality [Code]

Graders

A grader defines what to measure and how to score it. Click + in the Graders section to create one. Configure these fields:

Grader name: what this grader measures (e.g. “Response Quality”, “Format Check”)
Description (optional): helps human annotators understand the scoring criteria
Output data type: Number, Boolean, Categorical, or Comment
Score range: min and max values (e.g. 0-5)
Passing score: the minimum score to pass (e.g. 3)

A single grader can include LLM, code, and human definitions. During a run, Respan uses whichever config matches the evaluation mode.

LLM evaluation

Code evaluation

Human evaluation

Under LLM evaluation, write a definition that tells the judge model what to measure and how to score it.

The definition must include {{output}}. Optional variables:

Variable	Description
`{{input}}`	The original input sent to the application
`{{output}}`	The generated output being graded (required)
`{{expected_output}}`	The reference or expected output, when provided
`{{metadata}}`	Custom metadata attached to the run
`{{metrics}}`	System metrics captured during the run

Click the pencil icon to configure the judge model and settings like temperature.

Test run the grader against sample inputs to verify scoring behavior before using it.

Blocks

Blocks are the building pieces of the evaluator workflow. Drag them from the palette and connect them on the canvas.

Block	Options	Description
Markers	`Original input`, `Final result`	Entry and exit points. Every evaluator starts with `Original input` and ends with `Final result` or a condition.
Graders	Your custom graders (e.g. “Response Quality”)	Run LLM, code, or human grading. Each grader block can be assigned to a specific evaluation mode (LLM, Code, or a human reviewer).
Conditions	`If ... Then`, `If ... Then ... Else`	Branch the flow based on values. Operators: `>`, `<`, `>=`, `<=`, `==`, `!=`.
Compute	`Average of N values`, `Weighted average of N values`	Aggregate scores from multiple graders. Set custom weights for weighted averages.
Metrics	`Completion tokens`, `Cost`, `Latency`, `Model`, `Prompt tokens`, `Total tokens`	Built-in span values. Use in conditions or compute blocks.
Constants	`Number`, `Text`, `True`, `False`	Fixed values for thresholds, labels, or flags.

Deploy

Once the evaluator is ready:

Test run: validate the full flow against sample data
Deploy: publish as a new version
Versions: view history and load older versions back into the editor