Launch Week Day 2: Evals

This is Day 2 of Respan Launch Week.

Evals are honestly one of the hardest parts of building AI applications. It's not that you can't run them. It's that it's usually messy, hard to compare, and even harder to iterate on. So what we tried to do with Respan is make the whole process a lot more systematic.

The eval pipeline

Eval pipeline

When we think about an eval pipeline, a few pieces come together:

The thing you're iterating on. A prompt, a model, or a set of configs. Whatever you're trying to improve.
A dataset. The set of inputs you want to test against. Usually this comes from production data, sometimes a curated test set.
Evaluators. The scoring layer that tells you whether the output is actually good.
Experiments. A single run that ties it all together: your prompt + dataset + evaluators → results.
Comparison. Run two versions side by side and see exactly what changed.

Evaluators as workflows

This is where things usually get tricky. Traditionally you have to choose between an LLM judge, code-based checks, or human review. In reality, most teams need all three.

Instead of forcing you into one option, we made evaluators into workflows. You can route outputs through an LLM, add rule-based checks, and then only send certain cases to human review if they actually need it.

For example: if an LLM judge gives a low confidence score, you can automatically assign that case for a human to review. High-confidence passes skip the queue entirely.

Evaluator workflow

Run experiments and compare

An experiment takes your prompt or model, runs it against the dataset, applies your evaluators, and gives you a summary of how it performed. Each row shows the input, the output, and every evaluator score.

If you included human review in your evaluator workflow, those cases show up as pending until someone reviews them.

Instead of looking at results in isolation, run another version with a new prompt or a different model and compare them side by side. You can see exactly where it improved, where it regressed, and by how much.

Experiment comparison

Iterate

From there it's a loop: test, compare, refine, repeat. The goal is to move away from guessing and have a clear, repeatable way to improve your AI system.

The eval loop

Evals are available now. Check out the docs to run your first experiment.

Not on Respan yet? Get started in under 5 minutes:

npx @respan/cli setup

To stay updated for the rest of Launch Week, follow us on X or join our Discord community!

This is Day 2 of Respan Launch Week.

The eval pipeline

Eval pipeline

When we think about an eval pipeline, a few pieces come together:

The thing you're iterating on. A prompt, a model, or a set of configs. Whatever you're trying to improve.
A dataset. The set of inputs you want to test against. Usually this comes from production data, sometimes a curated test set.
Evaluators. The scoring layer that tells you whether the output is actually good.
Experiments. A single run that ties it all together: your prompt + dataset + evaluators → results.
Comparison. Run two versions side by side and see exactly what changed.

Evaluators as workflows

This is where things usually get tricky. Traditionally you have to choose between an LLM judge, code-based checks, or human review. In reality, most teams need all three.

For example: if an LLM judge gives a low confidence score, you can automatically assign that case for a human to review. High-confidence passes skip the queue entirely.

Evaluator workflow

Run experiments and compare

If you included human review in your evaluator workflow, those cases show up as pending until someone reviews them.

Experiment comparison

Iterate

From there it's a loop: test, compare, refine, repeat. The goal is to move away from guessing and have a clear, repeatable way to improve your AI system.

The eval loop

Evals are available now. Check out the docs to run your first experiment.

Not on Respan yet? Get started in under 5 minutes:

npx @respan/cli setup

To stay updated for the rest of Launch Week, follow us on X or join our Discord community!

Launch Week Day 2: Evals

The eval pipeline

Evaluators as workflows

Run experiments and compare

Iterate

You might also like

Launch Week Day 3: Respan Agent, CLI, and MCP

Launch Week Day 1: Monitors

The mess of OTel semantic conventions and why tracing CLI tools is still painful

Built for AI agents.
Break less.
Ship more.

Launch Week Day 2: Evals

The eval pipeline

Evaluators as workflows

Run experiments and compare

Iterate

You might also like

Launch Week Day 3: Respan Agent, CLI, and MCP

Launch Week Day 1: Monitors

The mess of OTel semantic conventions and why tracing CLI tools is still painful

Built for AI agents.
Break less.
Ship more.

Launch Week Day 2: Evals

The eval pipeline

Evaluators as workflows

Run experiments and compare

Iterate

You might also like

Launch Week Day 3: Respan Agent, CLI, and MCP

Launch Week Day 1: Monitors

The mess of OTel semantic conventions and why tracing CLI tools is still painful

Built for AI agents. Break less. Ship more.

Launch Week Day 2: Evals

The eval pipeline

Evaluators as workflows

Run experiments and compare

Iterate

You might also like

Launch Week Day 3: Respan Agent, CLI, and MCP

Launch Week Day 1: Monitors

The mess of OTel semantic conventions and why tracing CLI tools is still painful

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.