Measure and improve LLM output quality with automated judges, custom evaluators, and structured experiments - so you ship better models, not just different ones.
Minutes
to run your first eval
10+
built-in evaluator types
Any model
or prompt can be compared
CI/CD
integration via REST API
Teams that skip evals don't avoid quality problems - they just discover them later and with less information to act on.
✗ You don't know if a change made things better or worse
Without structured evals, every prompt change or model swap is a guess. You ship it, watch support tickets, and hope. There's no systematic way to measure quality before it reaches users.
✗ LLM quality is subjective - there's no consistent metric
Human intuition about 'good' output varies by reviewer and changes over time. Without defined criteria and an automated judge, quality measurement doesn't scale.
✗ Building an eval harness is a project in itself
Setting up testsets, running model calls, aggregating scores, and visualizing results takes weeks to build and maintain. Most teams skip it - and pay the cost in production.
✗ You only discover regressions after users complain
Without automated eval runs tied to deployments, a prompt change that degrades quality ships silently. You find out days later when support volume spikes.
✗ Comparing models requires ad hoc scripts and manual work
Running the same test prompts through GPT-4o and Claude and comparing results means writing one-off scripts, managing API keys, and diffing results by hand - every time.
A complete eval pipeline - testset management, evaluator library, experiment runner, and human annotation - in one place.
Respan evals work on testsets: collections of input/expected-output pairs. You point a set of evaluators at a testset and trigger a run. The evaluators score each example, results are aggregated into summary metrics, and changes are tracked over time. Experiments extend this by running the same testset through multiple configurations and comparing results side-by-side.
Build a testset
Sample production logs, import a CSV, or author examples manually. Tag examples by feature, use case, or difficulty.
Configure evaluators
Choose built-in judges (faithfulness, relevance, toxicity) or write a custom Python evaluator with your own scoring logic.
Run evals or experiments
Trigger a run against a testset. For experiments, configure multiple model/prompt variants - Respan runs them in parallel.
Compare and decide
Review score summaries, win rates, and side-by-side diffs. Use results to promote a change or block a deployment.
from respan import RespanClient
client = RespanClient(api_key="YOUR_RESPAN_KEY")
# Create a testset from production logs
testset = client.testsets.create_from_logs(
name="Support Q&A - Feb 2025",
filters={"feature": "support", "date_range": "last_7_days"},
sample_size=200,
)
# Run an LLM-as-judge eval
eval_run = client.evals.run(
testset_id=testset.id,
evaluators=["faithfulness", "coherence"],
judge_model="gpt-4o",
criteria="The response should accurately answer the user's question based on the provided context.",
)
print(f"Faithfulness: {eval_run.scores['faithfulness']:.2f}")
print(f"Coherence: {eval_run.scores['coherence']:.2f}")
# Or run an experiment to compare two models
experiment = client.experiments.create(
testset_id=testset.id,
variants=[
{"model": "gpt-4o", "prompt_id": "prod-v3"},
{"model": "claude-3-5-sonnet", "prompt_id": "prod-v3"},
],
evaluators=["faithfulness", "coherence"],
)If you're optimizing prompts for an AI feature: run the new version through a testset, compare scores to the current production version, and only ship if quality improves.
If you're evaluating whether to switch from GPT-4o to a cheaper model: run both through the same testset, compare win rates and cost, and make the decision with data.
If you ship prompts or model configs frequently: plug eval runs into your CI/CD pipeline. If a change drops scores below threshold, the deployment is blocked automatically.
If you have enterprise customers with quality requirements: track faithfulness and relevance scores over time. Export results to prove compliance or identify where quality dips.
If you have domain experts who can judge output quality: route ambiguous examples to an annotation queue. Use labeled outputs to build ground-truth testsets and fine-tune evaluators.
Evaluate any model
Import from
CI/CD integration
The eval harness itself isn't the hard part - it's the 10 things around it: testset versioning, evaluator management, result storage, trend tracking, experiment runners, human annotation queues, CI integration, and keeping all of it maintained as your models and prompts change. Respan ships all of that, so you spend time interpreting results instead of building infrastructure to collect them.