Experiments | Respan Docs

What is an experiment?

An experiment runs evaluations over a dataset and produces scores. It generates outputs by running every row in your dataset through a prompt or model, then runs the evaluator workflow over each output, and aggregates the results. Go to Experiments and click New experiment to get started.

Experiment types

When creating an experiment, you choose a task type that determines how outputs are generated. Pick the type that matches what you want to test.

Type	What it does	Use when
Evaluate a prompt version	Fills a prompt template with each dataset row’s variables, generates new outputs, then runs evaluators	You want to test or compare prompt versions
Evaluate with a model	Sends each dataset row’s `input` directly to a model you configure, generates new outputs, then runs evaluators	You want to compare models, temperatures, or other generation settings
Evaluate dataset outputs	Skips generation entirely, reuses the outputs already stored in your dataset, then runs evaluators	You already have outputs and just want to score them

Evaluate a prompt version

Use Prompt when you have a saved prompt template with variables like {{question}} and want to test how it performs across a dataset.

Select your dataset, pick Evaluate a prompt version as the task type
Choose the prompt and version to test
Select evaluators to score the outputs
Click Create

Respan fills the template with each row’s variables, generates an output for every row, and runs the evaluators on the results.

To compare prompt versions, create multiple experiments with the same dataset and evaluators but different prompt versions.

Evaluate with a model

Use this when you want to compare models or generation settings. No prompt template needed.

Select your dataset, pick Evaluate with a model as the task type
Configure the model and generation settings (temperature, max tokens, etc.)
Select evaluators to score the outputs
Click Create

Respan sends each row’s input directly to the model, generates outputs, and runs the evaluators.

To compare models, create multiple experiments with the same dataset and evaluators but different model configurations.

Evaluate dataset outputs

Use this when your dataset already contains outputs and you only want to score them without calling a model.

Select your dataset, pick Evaluate dataset outputs as the task type
Select evaluators to score the existing outputs
Click Create

No generation happens. Respan runs the evaluators directly on the outputs stored in your dataset.

Results and analytics

After an experiment finishes, inspect the generated outputs and evaluator scores per row.

The Analytics tab compares evaluator score distributions across experiments. The histogram groups results into score ranges so you can spot patterns and compare runs side by side.

Compare and iterate

To improve your outputs, run multiple experiments and compare the results:

Run your first experiment with the current prompt or model
Review the scores and identify where outputs fall short
Update your prompt version, switch models, or adjust settings
Run a new experiment with the same dataset and evaluators
Use the Analytics tab to compare score distributions across experiments and pick the best configuration

Repeat until you are satisfied with the quality, then deploy the winning prompt or model to production.