What is LLM-as-judge evaluation?

LLM-as-judge is an evaluation technique where a separate, typically stronger model (the judge) assesses the output of your production model against defined criteria. You describe what 'good' looks like in natural language, and the judge scores each output on your testset. This scales to thousands of examples without human review on every run.

How do I build a testset?

Testsets in Respan can be built from production logs (select a date range and filter criteria), imported from CSV or JSONL, or authored manually in the UI. The most effective testsets combine representative production examples with hand-crafted edge cases.

Can I write my own evaluator?

Yes. Custom evaluators are Python functions that receive the input and output and return a score between 0 and 1. You can call external APIs, run regex checks, use embedding similarity, or apply any domain-specific logic. Custom evaluators run in Respan's secure execution environment.

What's the difference between an eval run and an experiment?

An eval run applies a set of evaluators to a fixed testset to produce quality scores. An experiment runs the same testset through two or more model/prompt configurations and compares the results - it's designed to answer 'which version is better?' rather than 'how good is this version?'

Can evals run automatically on every deployment?

Yes. The Respan API exposes eval run endpoints that can be triggered from CI/CD pipelines. You define quality thresholds, and the pipeline blocks the deployment if scores drop below them - similar to a test suite gate.

How does human annotation work with evals?

You can route any log or eval output to an annotation queue. Reviewers label outputs with thumbs up/down or custom label schemas. Labeled data is versioned and can be used as testsets in future eval runs, closing the human feedback loop.

How many evaluators can I run per testset?

There's no hard limit. Multiple evaluators can run in parallel on the same testset, producing a score matrix. You can weight and combine scores, or analyze each dimension independently.

LLM Evaluations

Measure and improve LLM output quality with automated judges, custom evaluators, and structured experiments - so you ship better models, not just different ones.

Start free View docs

Trusted in production

Minutes

to run your first eval

10+

built-in evaluator types

Any model

or prompt can be compared

CI/CD

integration via REST API

What breaks without structured evals

Teams that skip evals don't avoid quality problems - they just discover them later and with less information to act on.

✗ You don't know if a change made things better or worse

Without structured evals, every prompt change or model swap is a guess. You ship it, watch support tickets, and hope. There's no systematic way to measure quality before it reaches users.

✗ LLM quality is subjective - there's no consistent metric

Human intuition about 'good' output varies by reviewer and changes over time. Without defined criteria and an automated judge, quality measurement doesn't scale.

✗ Building an eval harness is a project in itself

Setting up testsets, running model calls, aggregating scores, and visualizing results takes weeks to build and maintain. Most teams skip it - and pay the cost in production.

✗ You only discover regressions after users complain

Without automated eval runs tied to deployments, a prompt change that degrades quality ships silently. You find out days later when support volume spikes.

✗ Comparing models requires ad hoc scripts and manual work

Running the same test prompts through GPT-4o and Claude and comparing results means writing one-off scripts, managing API keys, and diffing results by hand - every time.

What you get

A complete eval pipeline - testset management, evaluator library, experiment runner, and human annotation - in one place.

→Run LLM-as-judge evaluations on any testset with criteria defined in natural language
→Write custom Python evaluators for domain-specific quality checks - arbitrary scoring logic
→Build testsets from production logs - real-world examples, no manual curation required
→Track quality scores over time to detect regressions before deployment
→Run experiments to compare models, prompt variants, or parameters side-by-side
→Use built-in evaluators: faithfulness, coherence, relevance, toxicity, sentiment, and more
→Collect human annotations on outputs and integrate labeled data into eval pipelines
→Export eval results via API for integration into CI/CD pipelines and quality gates
→Schedule automated eval runs triggered by model or prompt changes
→Visualize results with win rates, score distributions, and side-by-side output diffs

How it works

Respan evals work on testsets: collections of input/expected-output pairs. You point a set of evaluators at a testset and trigger a run. The evaluators score each example, results are aggregated into summary metrics, and changes are tracked over time. Experiments extend this by running the same testset through multiple configurations and comparing results side-by-side.

Build a testset

Sample production logs, import a CSV, or author examples manually. Tag examples by feature, use case, or difficulty.

Configure evaluators

Choose built-in judges (faithfulness, relevance, toxicity) or write a custom Python evaluator with your own scoring logic.

Run evals or experiments

Trigger a run against a testset. For experiments, configure multiple model/prompt variants - Respan runs them in parallel.

Compare and decide

Review score summaries, win rates, and side-by-side diffs. Use results to promote a change or block a deployment.

from respan import RespanClient

client = RespanClient(api_key="YOUR_RESPAN_KEY")

# Create a testset from production logs
testset = client.testsets.create_from_logs(
    name="Support Q&A - Feb 2025",
    filters={"feature": "support", "date_range": "last_7_days"},
    sample_size=200,
)

# Run an LLM-as-judge eval
eval_run = client.evals.run(
    testset_id=testset.id,
    evaluators=["faithfulness", "coherence"],
    judge_model="gpt-4o",
    criteria="The response should accurately answer the user's question based on the provided context.",
)

print(f"Faithfulness: {eval_run.scores['faithfulness']:.2f}")
print(f"Coherence:    {eval_run.scores['coherence']:.2f}")

# Or run an experiment to compare two models
experiment = client.experiments.create(
    testset_id=testset.id,
    variants=[
        {"model": "gpt-4o", "prompt_id": "prod-v3"},
        {"model": "claude-3-5-sonnet", "prompt_id": "prod-v3"},
    ],
    evaluators=["faithfulness", "coherence"],
)

Who uses this and how

Prompt iteration

If you're optimizing prompts for an AI feature: run the new version through a testset, compare scores to the current production version, and only ship if quality improves.

Model migration

If you're evaluating whether to switch from GPT-4o to a cheaper model: run both through the same testset, compare win rates and cost, and make the decision with data.

Regression gates

If you ship prompts or model configs frequently: plug eval runs into your CI/CD pipeline. If a change drops scores below threshold, the deployment is blocked automatically.

Quality SLAs

If you have enterprise customers with quality requirements: track faithfulness and relevance scores over time. Export results to prove compliance or identify where quality dips.

Human feedback loops

If you have domain experts who can judge output quality: route ambiguous examples to an annotation queue. Use labeled outputs to build ground-truth testsets and fine-tune evaluators.

Works with your stack

Evaluate any model

OpenAI GPT-4o / o1 / o3
Anthropic Claude 3.5 / 3.7
Google Gemini
Mistral
Groq
Together AI
Any OpenAI-compatible endpoint
Self-hosted models

Import from

Respan production logs
CSV upload
JSONL upload
REST API
LangChain traces
LlamaIndex traces
OpenAI Evals format

CI/CD integration

GitHub Actions
GitLab CI
CircleCI
Jenkins
Any CI via REST API
Python SDK for scripting

No evals vs Respan Evaluations

ConcernWithout evalsWith Respan

Quality signalSupport tickets and user complaintsAutomated scores on every change

Model comparisonManual ad hoc scripts, one-off runsStructured experiments with statistical comparison

Regression detectionPost-deployment, days after the factPre-deployment, as a deployment gate

Evaluator setupWrite harness, scoring logic, and storageBuilt-in evaluators + custom Python functions

Human feedbackNo structured collection or linkage to evalsAnnotation queue + labeled data in pipelines

Testset managementCSV files scattered across reposVersioned testsets built from production logs

Why not build an eval harness yourself?

The eval harness itself isn't the hard part - it's the 10 things around it: testset versioning, evaluator management, result storage, trend tracking, experiment runners, human annotation queues, CI integration, and keeping all of it maintained as your models and prompts change. Respan ships all of that, so you spend time interpreting results instead of building infrastructure to collect them.

Frequently asked questions

Also in Respan

Tracing →Metrics →Prompt optimization →AI gateway →

Built for AI agents.
Break less.
Ship more.

Start for free Get a demo

What breaks without structured evals

Teams that skip evals don't avoid quality problems - they just discover them later and with less information to act on.

✗ You don't know if a change made things better or worse

Without structured evals, every prompt change or model swap is a guess. You ship it, watch support tickets, and hope. There's no systematic way to measure quality before it reaches users.

✗ LLM quality is subjective - there's no consistent metric

Human intuition about 'good' output varies by reviewer and changes over time. Without defined criteria and an automated judge, quality measurement doesn't scale.

✗ Building an eval harness is a project in itself

Setting up testsets, running model calls, aggregating scores, and visualizing results takes weeks to build and maintain. Most teams skip it - and pay the cost in production.

✗ You only discover regressions after users complain

Without automated eval runs tied to deployments, a prompt change that degrades quality ships silently. You find out days later when support volume spikes.

✗ Comparing models requires ad hoc scripts and manual work

Running the same test prompts through GPT-4o and Claude and comparing results means writing one-off scripts, managing API keys, and diffing results by hand - every time.

What you get

A complete eval pipeline - testset management, evaluator library, experiment runner, and human annotation - in one place.

→Run LLM-as-judge evaluations on any testset with criteria defined in natural language

→Write custom Python evaluators for domain-specific quality checks - arbitrary scoring logic

→Build testsets from production logs - real-world examples, no manual curation required

→Track quality scores over time to detect regressions before deployment

→Run experiments to compare models, prompt variants, or parameters side-by-side

→Use built-in evaluators: faithfulness, coherence, relevance, toxicity, sentiment, and more

→Collect human annotations on outputs and integrate labeled data into eval pipelines

→Export eval results via API for integration into CI/CD pipelines and quality gates

→Schedule automated eval runs triggered by model or prompt changes

→Visualize results with win rates, score distributions, and side-by-side output diffs

How it works

Build a testset

Sample production logs, import a CSV, or author examples manually. Tag examples by feature, use case, or difficulty.

Configure evaluators

Choose built-in judges (faithfulness, relevance, toxicity) or write a custom Python evaluator with your own scoring logic.

Run evals or experiments

Trigger a run against a testset. For experiments, configure multiple model/prompt variants - Respan runs them in parallel.

Compare and decide

Review score summaries, win rates, and side-by-side diffs. Use results to promote a change or block a deployment.

from respan import RespanClient client = RespanClient(api_key="YOUR_RESPAN_KEY") # Create a testset from production logs testset = client.testsets.create_from_logs( name="Support Q&A - Feb 2025", filters={"feature": "support", "date_range": "last_7_days"}, sample_size=200, ) # Run an LLM-as-judge eval eval_run = client.evals.run( testset_id=testset.id, evaluators=["faithfulness", "coherence"], judge_model="gpt-4o", criteria="The response should accurately answer the user's question based on the provided context.", ) print(f"Faithfulness: {eval_run.scores['faithfulness']:.2f}") print(f"Coherence: {eval_run.scores['coherence']:.2f}") # Or run an experiment to compare two models experiment = client.experiments.create( testset_id=testset.id, variants=[ {"model": "gpt-4o", "prompt_id": "prod-v3"}, {"model": "claude-3-5-sonnet", "prompt_id": "prod-v3"}, ], evaluators=["faithfulness", "coherence"], )

Who uses this and how

Prompt iteration

If you're optimizing prompts for an AI feature: run the new version through a testset, compare scores to the current production version, and only ship if quality improves.

Model migration

If you're evaluating whether to switch from GPT-4o to a cheaper model: run both through the same testset, compare win rates and cost, and make the decision with data.

Regression gates

If you ship prompts or model configs frequently: plug eval runs into your CI/CD pipeline. If a change drops scores below threshold, the deployment is blocked automatically.

Quality SLAs

If you have enterprise customers with quality requirements: track faithfulness and relevance scores over time. Export results to prove compliance or identify where quality dips.

Human feedback loops

If you have domain experts who can judge output quality: route ambiguous examples to an annotation queue. Use labeled outputs to build ground-truth testsets and fine-tune evaluators.

Works with your stack

Evaluate any model

OpenAI GPT-4o / o1 / o3
Anthropic Claude 3.5 / 3.7
Google Gemini
Mistral
Groq
Together AI
Any OpenAI-compatible endpoint
Self-hosted models

Import from

Respan production logs
CSV upload
JSONL upload
REST API
LangChain traces
LlamaIndex traces
OpenAI Evals format

CI/CD integration

GitHub Actions
GitLab CI
CircleCI
Jenkins
Any CI via REST API
Python SDK for scripting

No evals vs Respan Evaluations

ConcernWithout evalsWith Respan

Quality signalSupport tickets and user complaintsAutomated scores on every change

Model comparisonManual ad hoc scripts, one-off runsStructured experiments with statistical comparison

Regression detectionPost-deployment, days after the factPre-deployment, as a deployment gate

Evaluator setupWrite harness, scoring logic, and storageBuilt-in evaluators + custom Python functions

Human feedbackNo structured collection or linkage to evalsAnnotation queue + labeled data in pipelines

Testset managementCSV files scattered across reposVersioned testsets built from production logs

Why not build an eval harness yourself?

Frequently asked questions

LLM Evaluations

Trusted in production

What breaks without structured evals

What you get

How it works

Who uses this and how

Works with your stack

No evals vs Respan Evaluations

Why not build an eval harness yourself?

Frequently asked questions

Frequently asked questions

What is LLM-as-judge evaluation?

How do I build a testset?

Can I write my own evaluator?

What's the difference between an eval run and an experiment?

Can evals run automatically on every deployment?

How does human annotation work with evals?

How many evaluators can I run per testset?

Also in Respan

Built for AI agents. Break less. Ship more.

LLM Evaluations

Trusted in production

What breaks without structured evals

What you get

How it works

Who uses this and how

Works with your stack

No evals vs Respan Evaluations

Why not build an eval harness yourself?

Frequently asked questions

Frequently asked questions

What is LLM-as-judge evaluation?

How do I build a testset?

Can I write my own evaluator?

What's the difference between an eval run and an experiment?

Can evals run automatically on every deployment?

How does human annotation work with evals?

How many evaluators can I run per testset?

Also in Respan

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.