Building production chatbots demands rigorous evaluation beyond simple response accuracy. Hallucinated responses erode user trust, conversation derailment frustrates customers, and unchecked token usage can silently drain budgets. This checklist gives chatbot developers a structured framework to evaluate every dimension of LLM-powered conversational experiences before and after deployment.
Compare chatbot responses against verified knowledge base entries for factual claims. Build a golden dataset of at least 200 question-answer pairs spanning your domain. Track exact-match and semantic similarity scores over time.
Verify that the chatbot cites actual sources when making claims, not fabricated references. Test with questions that require specific documentation or policy references. Flag any response that invents URLs, document names, or statistics.
Ensure the chatbot gracefully declines questions outside its knowledge domain rather than confabulating answers. Test with 50+ adversarial out-of-domain prompts. Measure the refusal rate vs. hallucination rate for unknown topics.
Validate that the chatbot does not present outdated information as current, especially for time-sensitive domains like pricing or policies. Create test cases with known date-dependent answers. Monitor for stale data after knowledge base updates.
Audit responses involving numbers, percentages, and calculations for correctness. Chatbots frequently approximate or invent statistics. Build a dedicated test set of quantitative questions with known answers.
Test responses to questions that require combining information from multiple knowledge sources. These queries are especially prone to hallucination since the model may fabricate bridging facts. Evaluate both the final answer and intermediate reasoning steps.
Check whether the chatbot gives contradictory answers to paraphrased versions of the same question. Run semantic equivalence tests across at least 30 question clusters. Any contradiction signals unreliable grounding.
Verify the chatbot does not mix up entities with similar names, such as confusing product variants or customer tiers. Create targeted test cases with easily confusable entities from your domain. Measure entity-level precision.
Evaluate whether the chatbot's expressed confidence aligns with actual accuracy. Models that say 'I'm certain' while being wrong are more dangerous than those that hedge appropriately. Track calibration curves across topic categories.
Systematically map where the chatbot's knowledge ends by testing increasingly obscure or specific questions. Understanding the boundary helps you set guardrails and fallback escalation triggers. Document the boundary for each topic cluster.
Test whether the chatbot maintains context across 5, 10, and 20+ turn conversations without losing track of the user's original intent. Use scripted dialogues that reference earlier turns. Measure context decay rate as conversation length increases.
Evaluate how gracefully the chatbot handles abrupt topic switches within a conversation. Users frequently jump between unrelated questions. The chatbot should neither confuse contexts nor lose prior conversation state.
Assess whether the chatbot asks relevant clarifying questions when user intent is ambiguous rather than guessing incorrectly. Test with intentionally vague inputs. Good clarification questions should narrow the problem space efficiently.
Test the chatbot's ability to recover when it gives an incorrect response and the user corrects it. The bot should acknowledge the correction, update its understanding, and not repeat the mistake. Track recovery success rate.
Verify that the chatbot correctly identifies whether the user is asking a question, making a complaint, requesting an action, or providing feedback. Misclassified dialogue acts lead to irrelevant responses. Test across 8+ dialogue act categories.
Monitor for conversational loops where the chatbot repeats the same response or question despite user attempts to move forward. Implement automated loop detection in evaluation pipelines. This is a top driver of user frustration.
Test whether the chatbot correctly resolves pronouns like 'it', 'that', and 'they' to the right entities from conversation history. Coreference errors cause subtle but confusing misunderstandings. Build test cases with complex reference chains.
Evaluate whether the chatbot adapts its tone when users express frustration, urgency, or satisfaction. A cheerful response to an angry customer is a critical failure. Test with sentiment-tagged conversation scripts.
If the chatbot summarizes prior conversation turns, verify the summaries are accurate and complete. Summarization errors compound over long conversations and lead to context drift. Compare summaries against ground truth at regular intervals.
Test whether the chatbot provides clear closure when a conversation concludes, including next steps or follow-up options. Abrupt endings leave users uncertain whether their issue was resolved. Evaluate end-of-conversation satisfaction.
Define measurable brand voice attributes (e.g., formality level, humor usage, empathy markers) and score responses against them. Use a rubric with 5+ dimensions rated on a consistent scale. Track drift over prompt changes and model updates.
Test whether the chatbot maintains its assigned persona when users attempt jailbreaking or role-reversal prompts. The bot should not adopt a different personality or break character. Run red-team exercises with at least 30 adversarial scenarios.
Verify that the chatbot maintains a consistent tone whether discussing billing disputes, product features, or casual conversation. Tone should vary appropriately by context but remain within brand guidelines. Sample from 10+ topic categories.
Ensure the chatbot matches the expected formality for your audience segment. A B2B enterprise chatbot using slang or a Gen Z consumer bot being overly formal both hurt engagement. Calibrate formality on a defined scale.
Test how the chatbot responds to user humor and sarcasm without misinterpreting intent. Also verify that any chatbot humor is appropriate and on-brand. Mishandled humor is a frequent source of PR incidents.
Evaluate responses for cultural appropriateness across your target demographics. Test with culturally specific references, holidays, and customs. A culturally insensitive response can cause significant brand damage.
Monitor whether response lengths are appropriate and consistent. Excessively verbose responses waste user time; overly terse responses feel unhelpful. Define target length ranges per query type and measure compliance.
Assess whether the chatbot expresses appropriate empathy in support scenarios without sounding scripted or insincere. Use human evaluators to rate empathy quality on a Likert scale. Compare across different complaint categories.
If the chatbot operates in multiple languages, verify that the persona and tone are consistent across all supported languages. Personality often shifts during translation. Test identical scenarios in each language.
After every model update, system prompt change, or fine-tuning iteration, run a standardized personality test suite. Track personality metrics on a dashboard over time. Set automated alerts for significant personality drift.
Measure average input and output tokens per conversation across user segments and query types. Identify the top 10% most expensive conversations and analyze root causes. This baseline is essential for any optimization effort.
Review system prompts and few-shot examples for unnecessary verbosity that inflates every request's token count. Even small reductions in system prompt length compound across millions of conversations. Target 20%+ reduction without quality loss.
Measure time-to-first-token and total response time under realistic load conditions. Users expect chatbot responses within 1-2 seconds for simple queries. Benchmark against competitors and user satisfaction thresholds.
Evaluate hit rates for semantic caching of common queries. High-frequency questions like business hours or return policies should be cached to eliminate redundant LLM calls. Measure cost savings and latency improvement from caching.
If using multiple models, analyze whether simpler queries are being routed to expensive large models unnecessarily. Implement tiered routing where GPT-4-class models handle complex queries and smaller models handle FAQ-type questions. Track per-tier costs.
Analyze how much of the context window is used by conversation history vs. system prompt vs. retrieved context. Implement intelligent conversation summarization to reduce history token usage. Monitor context window waste.
For non-real-time workloads like conversation analysis or training data generation, evaluate batch API pricing and throughput. Batch processing can reduce costs by 50% for offline tasks. Measure queue latency impact.
Implement and test per-conversation and per-user token budgets to prevent runaway costs from adversarial or exceptionally long conversations. Verify that budget limits trigger graceful degradation, not abrupt cutoffs. Test edge cases.
Calculate the fully loaded cost of each successful conversation resolution, including LLM tokens, embedding calls, and infrastructure. Compare against human agent cost per resolution. Track this metric weekly.
Simulate peak traffic scenarios (3-5x normal load) and measure latency degradation, error rates, and cost spikes. Identify the breaking point of your infrastructure. Plan auto-scaling thresholds based on results.
Verify that the chatbot does not store, repeat, or leak personally identifiable information shared by users. Test with synthetic PII inputs including SSNs, credit card numbers, and email addresses. Ensure PII is redacted from logs.
Run red-team tests to ensure the chatbot cannot be manipulated into generating harmful, offensive, or illegal content. Use established prompt injection and jailbreak datasets. Test at least 100 adversarial prompts per release cycle.
Verify chatbot responses comply with relevant regulations (GDPR, CCPA, HIPAA, industry-specific rules). Financial advice, medical information, and legal guidance require specific disclaimers. Audit for compliance gaps quarterly.
Test for demographic biases in chatbot responses across gender, race, age, and other protected categories. Use standardized bias benchmarks adapted to your domain. Document bias metrics and remediation efforts.
Test the chatbot's resilience to prompt injection attacks that attempt to override system instructions. Include indirect injection via user-supplied content like pasted text or URLs. Measure bypass success rate.
Test with empty inputs, extremely long messages, special characters, mixed languages, and malformed text. The chatbot should handle all edge cases gracefully without crashes or nonsensical responses. Build an automated edge case test suite.
Verify that the chatbot correctly identifies when to escalate to a human agent based on sentiment, complexity, or explicit user request. Missed escalations cause customer churn; false escalations waste agent time. Test with 50+ escalation scenarios.
Audit that conversation data is stored, retained, and deleted according to your data retention policies. Verify that users can request conversation deletion and that it is executed completely. Test the deletion workflow end-to-end.
Ensure chatbot responses work well with screen readers, support plain language alternatives, and meet WCAG guidelines for any visual elements. Test with assistive technology users. Document accessibility compliance status.
Test whether conversations can be resumed after system outages or failovers. Users should not lose conversation context due to infrastructure issues. Simulate failure scenarios and measure conversation recovery success rate.
Respan helps chatbot teams run continuous evaluations across accuracy, coherence, personality, and safety dimensions. Connect your conversation logs and get automated quality scores with actionable insights — no manual annotation required.
Try Respan free