GEPA: self-improving prompt optimization
A Generate-Evaluate-Promote-Analyze loop for systematically improving prompts with data.
Set up Respan
- Sign up — Create an account at platform.respan.ai
- Create an API key — Generate one on the API keys page
- Add credits or a provider key — Add credits on the Credits page or connect your own provider key on the Integrations page
Overview
GEPA (Generate, Evaluate, Promote, Analyze) is a self-improving workflow for prompt optimization. Instead of guessing which prompt changes will work, you run a structured loop that uses evaluation data to drive improvements.
Each cycle produces a measurably better prompt. Over multiple iterations, you converge on an optimized prompt backed by evidence.
The GEPA loop
1. Generate — Create prompt candidates
Start with your current prompt and create variants. Variations can target:
- Instructions: Rewrite the system message with different framing
- Examples: Add, remove, or change few-shot examples
- Constraints: Add guardrails, output format rules, or tone guidelines
- Models: Test the same prompt on different models
Create these as prompt versions in Respan:
- Go to Prompts
- Open your prompt
- Create new versions with your variations
Example: For a support chatbot prompt, you might create:
- v1 (baseline): Simple instructions
- v2: Added few-shot examples of good responses
- v3: Added explicit constraints (“never say I don’t know”)
- v4: Restructured as step-by-step reasoning
2. Evaluate — Test with data
Run all candidates against the same dataset using experiments:
- Go to Experiments > + New experiment
- Select your prompt and check all versions to compare
- Load your testset (curated from production logs or manually created)
- Run the experiment
- Run your evaluator(s) on all outputs
Key evaluators to set up:
Use at least 20-30 test cases per experiment. Fewer than that and your results may not be statistically meaningful.
3. Promote — Deploy the winner
Compare experiment results across all versions:
- Average scores per evaluator per version
- Pass rate for boolean evaluators
- Cost and latency differences between versions/models
If a candidate clearly outperforms the baseline:
- Set it as the active version in the Prompts page
- Your production code automatically picks up the new version (if using the Prompts API)
- Monitor with online evaluation to confirm the improvement holds in production
If no candidate is clearly better, move to the Analyze step.
4. Analyze — Learn from failures
Look at the test cases where the promoted version still scored poorly:
- Filter experiment results for low-scoring rows
- Read the inputs, outputs, and evaluator reasoning
- Identify patterns:
- Are there specific question types the prompt handles poorly?
- Are there edge cases not covered by the instructions?
- Is the model struggling with certain reasoning tasks?
These insights feed the next Generate step. Each pattern you identify becomes a targeted improvement in your next prompt candidate.
Implementing GEPA with Respan
Here’s a practical implementation using Respan’s features:
Setup (one-time)
Run the loop
Automate the monitoring
After promoting a version, set up continuous monitoring:
- Online evaluation: Automatically score 10-20% of production traffic
- Alert on regression: Set up an automation that alerts when average scores drop below your threshold
- Collect new failures: Periodically export low-scoring production logs to add to your test dataset
Example: 3 iterations of GEPA
Each iteration is small, measurable, and evidence-based.
Tips
- Keep a changelog: Document what changed in each version and why
- Don’t change too much at once: Isolate variables so you know what caused the improvement
- Grow your dataset: Add production failures after every iteration
- Set a target: Define what score counts as “good enough” so you know when to stop optimizing
- Automate what you can: Use online evals and automations to reduce manual review work