Evaluators

Set up LLM, code, and human evaluators to score your AI outputs.
  1. Sign up — Create an account at platform.respan.ai
  2. Create an API key — Generate one on the API keys page
  3. Add credits or a provider key — Add credits on the Credits page or connect your own provider key on the Integrations page

Add the Docs MCP to your AI coding tool to get help building with Respan. No API key needed.

1{
2 "mcpServers": {
3 "respan-docs": {
4 "url": "https://mcp.respan.ai/mcp/docs"
5 }
6 }
7}

Evaluators score your LLM outputs — automatically with an LLM judge, programmatically with code, or manually with human review. Create them on Respan, then trigger from code, the gateway, or experiments.

This is a beta feature.

Set up a grader

Within an evaluator, graders are the individual metrics you score. Create a grader from the GRADERS section in the evaluator builder.

A grader starts with a name, an optional description, an output data type, a score range, and a passing score. The description is especially useful for human review because annotators can use it to understand what the metric measures and how it should be judged.

The output data type, score range, and passing score define the shared scoring contract for that grader. In most cases, these stay the same no matter how the grader is evaluated.

A single grader can include both an LLM evaluation config and a Code evaluation config at the same time. During an evaluation run, Respan loads the config that matches how the grader is assigned, so the same grader can be used with human, llm, or code evaluation workflows.

To edit a grader, click on the pencil icon shown on the active grader block.

LLM evaluation lets a grader score outputs automatically with a judge model. This config lives inside the grader and uses the grader’s data type, score range, and passing score.

1

Write the definition

Use the Definition field to describe what the grader should measure and how the model should score it.

The definition must include {{output}}. You can also reference these optional variables:

VariableDescription
{{input}}The original input sent to the application
{{output}}The generated output being graded
{{expected_output}}The reference or expected output, when provided
{{metadata}}Custom metadata attached to the run
{{metrics}}System metrics captured during the run

Use the definition to explain both the grading criteria and how scores within the grader’s range should be interpreted.

2

Configure model settings

Click the pencil icon next to LLM evaluation to open the model configuration modal. This is where you choose the judge model and tune model settings such as temperature and other generation parameters.

3

Run a test

Click Test run to open a modal and try the grader against sample inputs before saving or using it in a real evaluation run. This helps you verify that the definition and model settings produce the scoring behavior you expect.

Create an evaluator

Evaluators are built visually from blocks. Drag blocks from the palette and snap them together magnetically like puzzle pieces to define the evaluation flow.

The evaluator builder includes these block types:

Markers

Markers define the entry and exit points of the flow.

  • Original input is the starting point of the evaluator.
  • Final result is the output marker that returns the final grading result.

Conditions

Conditions add branching logic to the evaluator.

Use If, If / Then / Else, and comparison operators to route the flow differently based on values in the graph.

Graders

Grader blocks use the graders configured in the grader section above. They let the evaluator run LLM, code, or human grading logic as part of the flow.

Compute

Compute blocks perform calculations on values in the graph.

Use them for operations such as averages and weighted averages.

Metrics

Metrics blocks represent values from the original input log.

These blocks expose built-in metrics such as latency, cost, model, and token counts.

Constants

Constants blocks represent fixed values that you provide directly in the evaluator.

Use constants for values such as numbers, text, true, or false.

To save a valid evaluator, the last block in the flow must be either Final result or If / Then / Else.

Test, deploy, and load versions

Once the evaluator flow is ready, you can test the whole evaluator, deploy it as a version, and load previous versions from history.

  • Test run runs the entire evaluator against sample data so you can validate the full flow before deployment.
  • Deploy publishes the current draft as a new evaluator version.
  • Versions opens the evaluator history.

From the version history, you can review the current draft, see previously deployed versions, and load an older version back into the editor when needed.