Set up Respan
Set up Respan
- Sign up — Create an account at platform.respan.ai
- Create an API key — Generate one on the API keys page
- Add credits or a provider key — Add credits on the Credits page or connect your own provider key on the Integrations page
Overview
GEPA (Generate, Evaluate, Promote, Analyze) is a self-improving workflow for prompt optimization. Instead of guessing which prompt changes will work, you run a structured loop that uses evaluation data to drive improvements.The GEPA loop
1. Generate — Create prompt candidates
Start with your current prompt and create variants. Variations can target:- Instructions: Rewrite the system message with different framing
- Examples: Add, remove, or change few-shot examples
- Constraints: Add guardrails, output format rules, or tone guidelines
- Models: Test the same prompt on different models
- Go to Prompts
- Open your prompt
- Create new versions with your variations
- v1 (baseline): Simple instructions
- v2: Added few-shot examples of good responses
- v3: Added explicit constraints (“never say I don’t know”)
- v4: Restructured as step-by-step reasoning
2. Evaluate — Test with data
Run all candidates against the same dataset using experiments:- Go to Experiments > + New experiment
- Select your prompt and check all versions to compare
- Load your testset (curated from production logs or manually created)
- Run the experiment
- Run your evaluator(s) on all outputs
| Evaluator | What it measures | Score type |
|---|---|---|
| Task accuracy | Does the output correctly complete the task? | Numerical (1-5) |
| Instruction following | Does it follow all prompt constraints? | Boolean |
| Tone | Is the tone appropriate for the use case? | Numerical (1-5) |
Use at least 20-30 test cases per experiment. Fewer than that and your results may not be statistically meaningful.
3. Promote — Deploy the winner
Compare experiment results across all versions:- Average scores per evaluator per version
- Pass rate for boolean evaluators
- Cost and latency differences between versions/models
- Set it as the active version in the Prompts page
- Your production code automatically picks up the new version (if using the Prompts API)
- Monitor with online evaluation to confirm the improvement holds in production
4. Analyze — Learn from failures
Look at the test cases where the promoted version still scored poorly:- Filter experiment results for low-scoring rows
- Read the inputs, outputs, and evaluator reasoning
- Identify patterns:
- Are there specific question types the prompt handles poorly?
- Are there edge cases not covered by the instructions?
- Is the model struggling with certain reasoning tasks?
Implementing GEPA with Respan
Here’s a practical implementation using Respan’s features:Setup (one-time)
Run the loop
Automate the monitoring
After promoting a version, set up continuous monitoring:- Online evaluation: Automatically score 10-20% of production traffic
- Alert on regression: Set up an automation that alerts when average scores drop below your threshold
- Collect new failures: Periodically export low-scoring production logs to add to your test dataset
Example: 3 iterations of GEPA
| Iteration | Change | Avg score | Improvement |
|---|---|---|---|
| Baseline (v1) | Simple instructions | 3.2 / 5 | — |
| Iteration 1 (v2) | Added few-shot examples | 3.8 / 5 | +19% |
| Iteration 2 (v4) | Added edge case handling from failure analysis | 4.1 / 5 | +8% |
| Iteration 3 (v6) | Switched to step-by-step reasoning for complex questions | 4.4 / 5 | +7% |
Tips
- Keep a changelog: Document what changed in each version and why
- Don’t change too much at once: Isolate variables so you know what caused the improvement
- Grow your dataset: Add production failures after every iteration
- Set a target: Define what score counts as “good enough” so you know when to stop optimizing
- Automate what you can: Use online evals and automations to reduce manual review work