If you are building a customer service agent in 2026, the architecture is not a research question. The patterns from Sierra, Decagon, Intercom Fin, Forethought, Cresta, and Fini have stabilized. Your build either matches these patterns at deeper scale (which is hard) or goes beyond them in specialized direction (which is rarer than teams think).
The category leaders trade at venture valuations that telegraph the size of the prize: Sierra at roughly $10B per Reuters reporting on its 2025 raise, and Decagon at roughly $4.5B per TechCrunch's coverage of its mid-2025 round. For most teams, the right answer is buy a platform and read Customer Service Agent Architecture. For the teams that genuinely need to build, this post is the walkthrough. It covers the architectural pieces that distinguish working systems from prototypes, the failure modes that consume teams who skip pieces, and the 90-day plan that produces something customers actually use.
Architecture overview
The simplified production architecture, end to end:
Each block matters. The hard parts cluster in three places: the knowledge graph (where retrieval quality lives), the action authorization layer (where regulatory and trust exposure concentrates), and the verification pass (the difference between trusted and abandoned).
The knowledge graph
The single most important infrastructure piece. Without it, retrieval and generation produce inconsistent results, the team papers over the gaps with prompt engineering, and the system drifts toward hallucination.
What the knowledge graph contains
The schema below is the minimum viable shape. The fields that will save you in production are provenance on every policy, structured_attributes on policies (so the LLM does not have to interpret free text), and version plus effective_date so historical answers are reproducible.
knowledge_graph:
policies:
- policy_id: <uuid>
title: <text>
content: <text>
version: <id>
effective_date: <ISO>
jurisdiction: [<list>]
structured_attributes:
return_window_days: <int>
refund_eligibility: <enum>
exception_conditions: [<list>]
provenance:
source_url: <ref>
last_verified: <ISO>
approved_by: <person or role>
faqs:
- faq_id: <uuid>
question_canonical: <text>
question_variants: [<list>]
answer: <text>
linked_policies: [<list>]
confidence: high | medium | low
procedures:
- procedure_id: <uuid>
steps: [<list>]
preconditions: [<list>]
authorized_actions: [<list>]
escalation_path: <ref>
customer_data_schema:
accessible_fields: [<list>]
pii_fields: [<list with masking rules>]
action_authorization_levels: [<list>]What matters in this schema. Three fields earn their keep. provenance.approved_by is the field that lets compliance defend a binding answer six months after it shipped, since "the LLM said so" does not survive a regulator. structured_attributes on policies is what stops the LLM from interpreting free-text policy at inference time, which is the precise failure mode that produced the Air Canada and Klarna stories. version plus effective_date lets you replay any past interaction against the policy state that was live at that moment, which is what discovery and chargeback defense actually require.
Schema field provenance at a glance
| Field | Why it earns its place | What breaks if you skip it |
|---|---|---|
policies.provenance.source_url | Traceable origin for every binding answer | Regulators and customers cannot verify what the agent quoted |
policies.structured_attributes | LLM consumes structured data, not free text | Hallucinated return windows, wrong refund amounts |
policies.version and effective_date | Replay past interactions against past policy | Disputes resolve on today's policy instead of the policy at the time of contact |
faqs.linked_policies | FAQs cannot drift from canonical policy | FAQ answers contradict policy after a policy update |
Why this matters
Klarna's pullback on full AI handling, reported by Bloomberg in mid-2025, and Air Canada's 2024 small claims loss over a chatbot that fabricated bereavement-fare refund eligibility share a single root cause. Both systems consumed free-text policy at inference time, the LLM interpreted, and the interpretation became the customer's binding answer.
The fix: structured policy data with explicit fields. The LLM consumes structured policy and renders human-readable explanations. It does not interpret raw policy documents at inference time.
Replay any answer in seconds, not days
Respan tracing captures every retrieval candidate, the cited knowledge graph entry, the policy version in effect, and the rendered customer-facing claim as one connected trace. When legal asks "what did the agent tell that customer on March 12," you replay the chain in the UI rather than excavating logs. Start at platform.respan.ai.
Knowledge retrieval
Hybrid retrieval that handles different query types:
Lexical retrieval for keyword-specific queries ("return policy for shoes," "shipping cost to Canada").
Semantic retrieval for descriptive queries ("how long does it take to get a refund," "can I exchange a product I bought as a gift").
Structured query for parametric questions ("when did I last order," "what's my account balance"). These resolve directly against customer data systems, not through retrieval.
Customer-context-aware retrieval. Retrieval is conditioned on which customer is asking. A premium tier customer's return window is different, a customer in California has different policy applicability, and a customer with prior account standing issues triggers different escalation paths.
The retrieval output is candidate context for the generation layer. Each candidate is structured with provenance.
Response generation with grounding
The generation layer is where hallucination prevention concentrates.
Strict RAG with citation requirements. Every claim in the response cites a specific knowledge graph entry. Ungrounded claims fail post-generation validation.
Structured output, not free prose. The response is constructed from templates filled with structured policy and customer data, not freely-generated text that happens to mention policy.
Confidence per claim. The generation layer reports confidence per individual claim, not per response overall. A response with three claims has three confidence values. Low-confidence claims trigger verification or escalation.
Customer-aware tone. The response style adapts to the channel (chat vs voice has different pacing), customer history (frustrated customers get more empathetic openings), and query category (refund denials are handled with care).
A simplified response schema. The fields that earn their keep here are claims (one entry per atomic factual statement, each independently citable and verifiable) and recommended_actions (the proposed action plus its authorization status, kept distinct from the customer-facing text).
response:
channel: chat | email | voice | other
customer_facing_text: <text>
claims:
- claim_text: <text>
cited_source: <ref to knowledge graph>
confidence: <float>
verified: <boolean>
recommended_actions:
- action_type: <enum>
authorization_required: <enum>
execution_status: pending | executed | denied
escalation_recommendation:
should_escalate: <boolean>
escalation_reason: <text if applicable>
escalation_target: human_agent | specialist | manager
metadata:
intent_classified: <text>
knowledge_retrieved: [<list>]
generation_model_version: <id>What matters in this schema. The claims array is the single most important structural decision in the response layer. Modeling claims as discrete, citable, individually-verifiable units (rather than a blob of customer_facing_text with a single response-level confidence) is what lets the verification pass actually do its job. The second non-obvious field is recommended_actions[].execution_status. Keeping the proposed action separate from the executed action is what prevents the agent from saying "I have issued your refund" before the refund has actually been issued, which is the cleanest source of trust failures in production. generation_model_version looks like metadata clutter until you ship a regression and need to bisect by model version.
Action authorization layer
The most legally consequential piece. The pattern: the LLM understands and routes, deterministic logic authorizes and executes.
The architecture
The schema below is two halves. action_request is what the LLM proposes. authorization_decision is what deterministic code returns. The two never collapse into one call.
action_request:
action_type: refund | return | exchange | account_change | discount | other
parameters: <map>
customer_id: <ref>
agent_context:
confidence: <float>
retrieved_authority: [<list>]
proposed_rationale: <text>
authorization_decision:
authorized: <boolean>
authorization_basis: <text>
bounded_amount: <decimal if applicable>
conditions: [<list>]
audit_record: <full lineage>What matters in this schema. Two fields deserve attention. agent_context.confidence is captured for audit and tuning, but it is explicitly not an input to the authorization decision. The LLM's confidence does not override policy. authorization_basis is the human-readable rationale the deterministic check writes into the audit record. When a chargeback or a regulator asks why this refund was issued, authorization_basis is the sentence that answers, and it is written by code, not by the model.
Authorization through deterministic logic. The LLM proposes the action. A separate code path checks policy eligibility, customer authorization, and any thresholds. The LLM's confidence does not override policy.
Bounded authority limits. The agent can authorize refunds up to $X without escalation. Refunds above $X require human approval. Discount stacking is bounded by configured rules. Subscription cancellations within the policy window are direct, outside the window require escalation. The bounds are configured per merchant or business unit.
Pattern detection on customer behavior. Customers exhibiting fraud patterns (rapid agent-style messaging, return rates above thresholds, claims that don't match purchase history) get flagged for human review. This is fraud detection adapted to support traffic.
Audit trail per action. Every action attempt produces a record: the LLM's proposal, the authorization check result, the executed outcome, the customer's reaction. Discovery and chargeback defense both depend on this.
The defense against the LLM-vs-LLM dynamic (customers running their own LLMs to argue with merchant LLMs) is structural. Whatever the customer's agent argues, the merchant's deterministic authorization logic checks against actual policy. The LLM communicates the answer, it does not produce it.
Block out-of-policy actions before they ship
The deterministic authorization layer is where Respan's evals and gateway earn their keep. CI-aware experiments block prompt or model changes that increase out-of-policy action proposals on a labeled gold set, and the gateway enforces per-action and per-customer spending caps before the call leaves your perimeter. Wire it up at platform.respan.ai.
Verification pass
Before any response or action reaches the customer, verification:
Citation validation. Every cited knowledge graph entry resolves and contains the cited content. Citations to nonexistent or modified entries fail.
Factual claim grounding. Every factual claim in the response traces to a knowledge graph entry or customer data field. Ungrounded claims are flagged.
Action authorization confirmation. Any cited action ("I've issued your refund") corresponds to an actually-executed action. Statements about actions that did not execute fail verification.
Pricing and availability currency. Cited prices match live prices, cited availability matches live inventory. Stale data in cited claims fails.
Customer-data accuracy. Statements about the customer's account, history, or preferences match actual customer state.
Failed verification triggers regeneration with refreshed context, graceful degradation to verified content, or escalation. Hallucinated content gets filtered before reaching the customer.
In observed cohorts of fintech and marketplace support traffic, between 5% and 10% of inbound interactions involve fee or charge disputes where a wrong answer is directly monetized as a chargeback or refund. Verification is what keeps that long tail from becoming a recurring loss line.
Audit trail
Every interaction produces a record. The fields that earn their keep are ai_processing (the full reasoning chain, not just the final response) and audit_metadata (model and prompt versions, knowledge graph version, retention rules).
interaction_record:
interaction_id: <uuid>
customer_id: <ref>
channel: chat | email | voice | other
start_timestamp: <ISO>
end_timestamp: <ISO>
conversation:
turns:
- speaker: customer | agent
content: <text>
timestamp: <ISO>
ai_processing:
intents_classified: [<list>]
knowledge_retrieved: [<list>]
actions_proposed: [<list>]
actions_authorized: [<list>]
escalations: [<list>]
confidence_distribution: <stats>
outcome:
resolution_status: resolved | escalated | abandoned | unresolved
customer_feedback: <if provided>
follow_up_within_7d: <boolean>
audit_metadata:
model_versions: <map>
prompt_versions: <map>
knowledge_graph_version: <id>
legal_hold: <boolean>
retention_expires_at: <ISO>What matters in this schema. Three fields are non-obvious and load-bearing. ai_processing.confidence_distribution (not a single response-level confidence) is what lets you find the conversations where the model was uncertain on the claim that ended up wrong, which is where regression tests come from. audit_metadata.knowledge_graph_version is the field that lets you reproduce a past answer against the policy state at that moment, since the policy may have changed twice since the customer asked. audit_metadata.legal_hold is the field that overrides retention rules for any conversation tagged into an open dispute or regulatory matter, and forgetting it is the cheapest way to make a deletion policy delete evidence.
These records support:
- Regulatory examination (industry-specific compliance)
- Litigation defense (Air Canada-style chatbot statements as binding)
- Internal audit and continuous evaluation
- Customer dispute resolution
- Quality assurance and agent training
Storage scales with interaction volume. A platform handling 1 million interactions per day produces roughly 5 to 50 GB of trace data daily. Cold storage for older records, hot for recent, with retention rules per applicable industry.
Build order
The system depends on lower layers being correct before higher layers can be evaluated. Skipping the order produces a v1 that looks like it works in demos and silently breaks in production. The order:
| Order | What you build | Eval gate before moving on |
|---|---|---|
| 1 | Knowledge graph: structured policies, FAQs, customer data schema with provenance per entry | KB coverage on a 100-query gold set; provenance traceable for every entry |
| 2 | Hybrid retrieval (lexical + semantic + structured) with customer-context awareness | Retrieval recall ≥ 85% on the gold set |
| 3 | Response generation with citation per claim and structured output | Citation grounding rate ≥ 95% on production sample |
| 4 | Action authorization layer with deterministic policy checks and bounded authority | In-policy actions authorized 100%, out-of-policy denied 100% |
| 5 | Verification pass (citation validation, fact grounding, action confirmation) | Hallucinated content blocked at the verifier ≥ 99% |
| 6 | Production sampling, eval pipeline, monitoring | Hallucination rate, resolution quality, escalation accuracy on dashboards |
Voice, multi-language, A/B infrastructure, and continuous KB improvement come after the core six are working. Wiring an LLM directly to support volume without the lower layers produces the failure modes the architecture is designed to prevent.
Build vs buy
The honest framing for most teams. The architecture above is buildable, but six months and two FTEs of dedicated platform work is the floor, not the ceiling. Use this table to decide which side of the line you sit on.
| Dimension | Buy a platform (Sierra, Decagon, Intercom Fin) | Build on your own substrate |
|---|---|---|
| Time to first production traffic | 4 to 12 weeks | 6 to 12 months |
| Floor staffing | 1 PM, 1 CX lead | 2 to 4 engineers, 1 PM, 1 CX lead, ongoing |
| Knowledge graph ownership | Platform-managed, you curate content | You own schema, ingestion, versioning |
| Action authorization control | Configured policy, bounded by platform primitives | Deterministic logic you write, full control |
| Audit and compliance | Platform-default, may need extension | Built to your retention and discovery requirements |
| Cost shape | Per-interaction or seat-based, predictable | Mostly fixed engineering, variable infra |
| Right answer when | Standard support flows, 80% deflection target, fast time-to-value | Specialized vertical, regulatory specifics, or differentiation is the agent itself |
If two or more rows of the buy column describe your situation, buy. If two or more rows of the build column do, the rest of this guide is for you.
What separates serious builds from prototypes
After watching the customer service AI category through 2025 to 2026:
The knowledge graph is the moat. Structured, versioned, attributed, current. Without it, everything downstream is unreliable.
Action authorization is deterministic. The LLM does not authorize through judgment. Deterministic logic does. The LLM communicates the result.
Verification is non-negotiable. Hallucinated facts and actions are existential trust failures. Verification runs on every response, not sampled.
Audit trail is queryable in seconds. Discovery and dispute response is fast and complete. The infrastructure was built before the inquiry.
Continuous evaluation is operational. Production sampling, adversarial testing, regression catches in CI. The eval set evolves, failures feed remediation.
Bounded authority preserves human judgment. Agents have specific authority for specific actions, everything beyond goes to humans. Operational metrics measure quality, not deflection.
These are the practices that produce systems users return to. Without them, the system runs the Klarna trajectory.
Wire monitors before you wire traffic
Hallucination rate, citation validation failure rate, action authorization denial rate, and escalation rate per intent category should all be on a dashboard before your first percent of production traffic. Respan ships these monitors with Slack, email, PagerDuty, and webhook routing so the on-call hears about a citation regression before the customer does. Configure them at platform.respan.ai.
How Respan fits
Building a customer service agent that matches Sierra and Decagon-grade reliability means treating tracing, evals, gateway, prompt management, and monitoring as first-class infrastructure. Respan is the substrate underneath the architecture above, from knowledge retrieval through verification pass.
- Tracing: every customer interaction captured as one connected trace across intent classification, knowledge retrieval, response generation, action authorization, and verification. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a refund is wrongly issued or a citation fails to ground, you need to replay the full chain of retrieval candidates, cited policies, and authorization decisions in seconds, not dig through siloed logs.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on hallucinated policy claims, fabricated refund eligibility, ungrounded citations, and out-of-policy action proposals before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Customer service traffic is bursty and price-sensitive, so semantic caching on repeat FAQ-style queries and fallback chains across providers protect both unit economics and uptime when a primary model degrades.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Intent classification prompts, response templates, escalation routing prompts, and verification prompts all belong in the registry so legal and CX leads can approve changes and you can roll back the moment a new variant starts confidently stating wrong fees.
- Monitors and alerts: hallucination rate, citation validation failure rate, action authorization denial rate, escalation rate, and resolution quality per intent category. Slack, email, PagerDuty, webhook. Alert the on-call the moment hallucination rate or citation failure rate crosses threshold, before the Air Canada-style binding statement reaches a customer.
A reasonable starter loop for customer service agent builders:
- Instrument every LLM call with Respan tracing including retrieval candidates, cited knowledge graph entries, action proposals, authorization decisions, and verification outcomes.
- Pull 200 to 500 production conversations into a dataset and label them for citation accuracy, action correctness, and resolution quality.
- Wire two or three evaluators that catch the failure modes you most fear (fabricated policy or refund eligibility, ungrounded citations, out-of-policy action authorizations).
- Put your intent classification, response generation, and verification prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so semantic caching on repeat queries cuts cost and fallback chains keep the agent online when a provider degrades.
Skip this loop and you ship the Klarna trajectory: confidently wrong policy answers, audit trails that cannot defend a chargeback, and a system customers learn not to trust.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- Evaluating Customer Service LLMs: four-dimension eval framework
- Customer Service Agent Architecture: patterns from Sierra, Decagon, helpdesk-native
- How Customer Support Teams Build LLM Apps in 2026: pillar overview
