Can I run custom Python evaluators?

Yes. Write custom evaluation functions in Python that score any aspect of an LLM output. These run alongside built-in evaluators and LLM-as-judge templates. Results are tracked and graphable over time.

How do I build eval datasets from production data?

Use the Respan dashboard or API to filter production logs by any criteria — model, user segment, time range, quality score, or custom metadata. Sample and export the results as an evaluation dataset.

Can I compare model variants statistically?

Yes. Run the same evaluators on traffic from different model or prompt variants. Respan shows side-by-side score distributions and flags statistically significant differences.

What export formats are supported for fine-tuning?

JSON, JSONL, and CSV. You can export raw inputs/outputs, evaluation scores, and human annotations. The API supports configurable filters so you can export exactly the subset you need.

Do I need engineering help to access production data?

No. Once the SDK is integrated (a one-time setup by engineering), all production LLM data is queryable directly from the Respan dashboard and API. No tickets or custom queries needed.

Can I run evaluations retroactively on historical data?

Yes. Apply new evaluators to any previously logged data. Useful for establishing baselines, auditing historical quality, or testing new evaluation criteria on past traffic.

How does human annotation integrate with evals?

Route outputs to annotation queues based on eval scores. Collect structured labels. Use annotated data as gold-standard testsets for future eval runs, closing the feedback loop.

Can I track experiments over time?

Yes. Every eval run is recorded with its configuration, dataset, and results. Compare experiments across time, models, and prompts in the experiment dashboard.

Solutions

Respan for Data Scientists

Build eval datasets from production traffic, run experiments, and measure quality — without waiting for engineering to export data. Real-world evals, not synthetic benchmarks.

Start free Get a demo

Proven at scale

Query production data directly. Custom Python evaluators. Export for fine-tuning.

What you get

Production data access, evaluation pipelines, experiment tracking, and dataset management in one place.

Query production data directly

Filter all LLM inputs and outputs by model, user, time range, quality score, or custom metadata. No engineering dependency.

Build eval datasets from real traffic

Sample production conversations to create representative evaluation sets. Filter by quality, topic, or edge cases.

Run LLM-as-judge evals

Define evaluation criteria in natural language. A stronger model scores your production outputs. Results tracked and graphable over time.

Write custom Python evaluators

Arbitrary scoring logic: regex checks, embedding similarity, API calls, domain-specific rules. Run alongside built-in evaluators.

Compare models statistically

Same testset, multiple model configurations. Side-by-side score distributions with statistical significance testing.

Annotate and label outputs

Route production outputs to review queues. Collect human labels with custom schemas. Track inter-annotator agreement.

Export datasets for fine-tuning

Export labeled examples in JSON, JSONL, or CSV. Filter by quality score, label, or metadata to curate training data.

Track quality over time

Dashboards for eval score trends, quality regressions, and metric comparisons across experiments.

How data scientists use Respan

Evaluation

→Run evals on real production traffic, not synthetic examples
→Compare model variants with controlled experiments
→Track quality scores over time to detect regressions
→Build gold-standard testsets from production data

Data & labeling

→Query production logs without engineering help
→Annotate outputs with custom labeling schemas
→Measure inter-annotator agreement for label quality
→Export curated datasets for fine-tuning

Research & optimization

→Test prompt variants against real user inputs
→Measure impact of model changes on quality metrics
→Build feedback loops between human annotations and evals
→Run retroactive evals on historical data

How it works

From production data to measured quality improvement — without building infrastructure.

Access production data

Query all LLM inputs and outputs with filters for user, model, feature, time range, and custom metadata.

→ Real production data, queryable

Build eval datasets

Sample production traffic to create representative evaluation sets. Filter by quality, topic, or edge cases.

→ Curated evaluation datasets

Run evaluations

Apply LLM judges, Python evaluators, or rule-based checks. Compare scores across model/prompt variants.

→ Quality scores and experiment results

Annotate and label

Route ambiguous cases to human review. Collect structured labels. Build gold-standard training data.

→ Labeled datasets for fine-tuning

Export and iterate

Export datasets for fine-tuning. Track quality trends. Feed improvements back into production.

→ Measurable quality improvement

By the numbers

100%

of production data queryable

Custom

Python evaluators supported

1-click

dataset export

Real-time

experiment tracking

Works with your stack

Data formats

JSON export
JSONL export
CSV export
REST API
Python SDK
Webhook streaming

Evaluate any model

OpenAI GPT-4o / o1 / o3
Anthropic Claude
Google Gemini
Mistral
Groq
Any OpenAI-compatible endpoint

Import from

Respan production logs
CSV upload
JSONL upload
LangChain traces
LlamaIndex traces
OpenAI Evals format

Frequently asked questions

Explore more

Evaluations →Tracing →For AI Teams →For Annotators →Eval docs →

Built for AI agents.
Break less.
Ship more.

Start for free Get a demo

What you get

Production data access, evaluation pipelines, experiment tracking, and dataset management in one place.

Query production data directly

Filter all LLM inputs and outputs by model, user, time range, quality score, or custom metadata. No engineering dependency.

Build eval datasets from real traffic

Sample production conversations to create representative evaluation sets. Filter by quality, topic, or edge cases.

Run LLM-as-judge evals

Define evaluation criteria in natural language. A stronger model scores your production outputs. Results tracked and graphable over time.

Write custom Python evaluators

Arbitrary scoring logic: regex checks, embedding similarity, API calls, domain-specific rules. Run alongside built-in evaluators.

Compare models statistically

Same testset, multiple model configurations. Side-by-side score distributions with statistical significance testing.

Annotate and label outputs

Route production outputs to review queues. Collect human labels with custom schemas. Track inter-annotator agreement.

Export datasets for fine-tuning

Export labeled examples in JSON, JSONL, or CSV. Filter by quality score, label, or metadata to curate training data.

Track quality over time

Dashboards for eval score trends, quality regressions, and metric comparisons across experiments.

How data scientists use Respan

Evaluation

→Run evals on real production traffic, not synthetic examples
→Compare model variants with controlled experiments
→Track quality scores over time to detect regressions
→Build gold-standard testsets from production data

Data & labeling

→Query production logs without engineering help
→Annotate outputs with custom labeling schemas
→Measure inter-annotator agreement for label quality
→Export curated datasets for fine-tuning

Research & optimization

→Test prompt variants against real user inputs
→Measure impact of model changes on quality metrics
→Build feedback loops between human annotations and evals
→Run retroactive evals on historical data

How it works

From production data to measured quality improvement — without building infrastructure.

Access production data

Query all LLM inputs and outputs with filters for user, model, feature, time range, and custom metadata.

→ Real production data, queryable

Build eval datasets

Sample production traffic to create representative evaluation sets. Filter by quality, topic, or edge cases.

→ Curated evaluation datasets

Run evaluations

Apply LLM judges, Python evaluators, or rule-based checks. Compare scores across model/prompt variants.

→ Quality scores and experiment results

Annotate and label

Route ambiguous cases to human review. Collect structured labels. Build gold-standard training data.

→ Labeled datasets for fine-tuning

Export and iterate

Export datasets for fine-tuning. Track quality trends. Feed improvements back into production.

→ Measurable quality improvement

Frequently asked questions

Respan for Data Scientists

Proven at scale

What you get

How data scientists use Respan

How it works

By the numbers

Works with your stack

Frequently asked questions

Frequently asked questions

Can I run custom Python evaluators?

How do I build eval datasets from production data?

Can I compare model variants statistically?

What export formats are supported for fine-tuning?

Do I need engineering help to access production data?

Can I run evaluations retroactively on historical data?

How does human annotation integrate with evals?

Can I track experiments over time?

Explore more

Built for AI agents. Break less. Ship more.

Respan for Data Scientists

Proven at scale

What you get

How data scientists use Respan

How it works

By the numbers

Works with your stack

Frequently asked questions

Frequently asked questions

Can I run custom Python evaluators?

How do I build eval datasets from production data?

Can I compare model variants statistically?

What export formats are supported for fine-tuning?

Do I need engineering help to access production data?

Can I run evaluations retroactively on historical data?

How does human annotation integrate with evals?

Can I track experiments over time?

Explore more

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.