The first four months of 2026 produced more concrete change in insurance AI than any prior year.
In January 2026, the NAIC launched a multistate pilot of the AI Systems Evaluation Tool, running through September 2026 across California, Colorado, Connecticut, Florida, Iowa, Louisiana, Maryland, Pennsylvania, Rhode Island, Vermont, Virginia, and Wisconsin. The Tool gives state examiners a structured framework for reviewing insurer AI governance during market conduct examinations. Adoption at the NAIC's Fall 2026 National Meeting is anticipated, transitioning AI oversight from principle-based bulletin to operationalized examination.
On December 11, 2025, Executive Order 14365 created a federal-state preemption fight on AI regulation. The NAIC publicly opposed the EO within days. Carriers now operate under both regimes simultaneously. State departments continue issuing bulletins, demanding inventories, and running examinations. The federal government asserts authority that may or may not preempt those activities.
On March 9, 2026, a federal magistrate judge in the District of Minnesota ordered UnitedHealth to produce internal documents on whether nH Predict, its AI claims algorithm, was designed to override clinical judgment. The court granted broad discovery across six of seven document categories. The case alleges a 90% error rate measured against appeal reversals. Cigna and Humana face parallel claims involving PXDX and similar algorithms. The litigation environment for any AI system in insurance claims is now structured around discovery, not just merits.
On March 17, 2026, Cytora launched Autopilot, a major agentic AI capability for end-to-end underwriting and claims workflow automation. Sixfold raised $30 million in January 2026 to scale its AI Underwriter, deployed at Zurich North America (200+ underwriters, two hours saved per submission), Guardian, Generali Global Corporate & Commercial, and Skyward Specialty (35% reduction in quote response time). Cytora's strategic alliance with LexisNexis Risk Solutions, announced April 23, 2026, embeds rich risk data into LLM-driven underwriting workflows. The category has matured.
For engineering teams building insurance AI products, the implications are concrete. The principle-based regulatory environment that existed through 2024 and most of 2025 is being replaced by an examination-ready environment in 2026 and beyond. Models without documented validation, monitoring, and audit trails fail the 2026 examination cycle. Claims AI without bias testing and reversal feedback becomes a class action. Carriers that built the infrastructure as a byproduct of operations get through cleanly; carriers that scramble during the exam find gaps.
This post is the engineering view of the insurance AI stack in 2026. It covers the five architectural patterns serious products converge on, the dual-regime regulatory environment, and where the engineering loop typically breaks.
The market in one paragraph
By mid-2026, insurance AI has split into five recognizable shapes. Submission and underwriting platforms like Cytora, Sixfold, Artificial Labs Ava, and Federato handle intake, triage, enrichment, and pricing recommendations for commercial and specialty insurance. Claims processing systems like EvolutionIQ (disability), Five Sigma (P&C), and various carrier-internal systems handle FNOL through coverage determination. Document and data extraction tools like V7 Go, Eigen Technologies, and Reducto convert unstructured submissions into structured data with field-level provenance. Compliance and data quality systems like DQPro (45% of Lloyd's market) and various AIS Program platforms support governance and examination readiness. Customer service and distribution AI handles policyholder service, broker support, and FAQ. Each shape has different audiences, different regulatory exposure, and different competitive moats. Engineers building for insurance need to know which shape they are building.
Pattern 1: AIS Program as architectural foundation
The NAIC Model Bulletin's AI Systems Program is what survives examination. The Bulletin (adopted by 24+ states plus DC) describes it as principle-based; the AI Evaluation Tool pilot is what makes it operational.
What an AIS Program requires:
- Model inventory with sufficient metadata for an examiner to understand what each model does, who owns it, and where it sits in the business
- Pre-deployment validation files for every Tier 1 model, including model cards, validation methodology, bias testing, and approvals
- Continuous monitoring evidence with documented metrics, frequencies, and threshold actions
- Adverse outcome and complaint records capturing AI involvement in consumer-impacting decisions
- Third-party model oversight with vendor evaluation files, audit rights, and remediation pathways
The pattern that fails: an AIS Program assembled retrospectively when the examiner arrives. The pattern that works: each component generated as a byproduct of how the systems run.
The full breakdown of what each component requires and the dual-regime context (Executive Order 14365 plus state authority) is in The NAIC AI Evaluation Tool: Engineering for the 2026 Pilot.
Pattern 2: AI as decision support, not decisioner
The architectural pattern that distinguishes claims AI that survives litigation from claims AI that produces it. The principle: the AI surfaces information and recommendations; the human adjuster makes determinations and records reasoning. Architecture, not procedure, enforces the distinction.
The patterns that work:
| Pattern | When |
|---|---|
| AI as decision support with mandatory human disposition | Coverage determination, utilization review, fraud flagging |
| AI as classifier, human as decisioner | Triage, intake, routing where AI output is operational |
| AI as input prep for deterministic rule engine | Personal lines with clear policy language |
The pattern that produced UnitedHealth's nH Predict litigation appears to have been an AI advisory tool deployed operationally in ways that made its recommendations determinative. The technical architecture allowed adjuster judgment; the operational metrics did not. Carriers that build the architecture and the metrics together protect adjuster judgment as the system of record.
The detailed architectural patterns and the specific operational practices that protect adjuster judgment are in Building Claims AI Without Becoming the Next nH Predict.
Pattern 3: Provenance and audit trail as infrastructure
Across underwriting and claims, the audit trail is what separates defensible from exposed. The Estate of Lokken v. UnitedHealth discovery order on March 9, 2026 specifically targeted documents about model design, training data, and operational practice. Carriers without queryable, indexed, retained audit trails face discovery costs in the millions and findings that drive settlements higher.
What the audit infrastructure requires:
- Field-level provenance from extracted facts back to source documents, with specific location references
- Decision lineage capturing model versions, prompt versions, retrieved context, and human review
- Immutable records with cryptographic tamper-evidence
- Cross-references between claims, policies, models, and human decisions
- Indexed queries that return common discovery requests in seconds rather than weeks
- Retention for the regulatory period (7-10 years for most insurance), with appropriate storage tiering
V7 Go's "visual grounding" approach (linking every extracted field back to its exact source location in the document) is one specific implementation. Cytora's "explainable agentic reasoning" with "every workflow step fully auditable" is another. The pattern is recognizable across serious products.
Pattern 4: Bias monitoring as continuous infrastructure
Required by NAIC Model Bulletin Section 4 and operationalized by the AI Evaluation Tool. Required explicitly by Colorado SB 21-169 (now covering life, auto, and health), NY DFS Circular Letter 2024-7, and California SB 1120 (for health AI). Required by the EU AI Act (insurance pricing as high-risk).
The continuous monitoring that mature systems implement:
- Selection rate per protected group in underwriting, computed weekly with disparate impact ratios
- Pricing parity controlled for legitimate risk factors
- Claim denial rates per group in claims processing
- Settlement amounts per group controlled for severity
- Time-to-resolution per group across all claims
- Appeal rates and reversal rates per group
The architectural prerequisite: demographic data isolated from inference paths. Models that physically cannot access demographic data have stronger legal defense and produce cleaner monitoring. Joining demographic data to outcomes for measurement happens at the audit and monitoring layer, not at inference time.
The eval framework for underwriting LLMs specifically, including the four-dimension framework (risk selection accuracy, calibration, bias, audit grounding), is in Evaluating Underwriting LLMs.
Pattern 5: Vendor oversight as carrier-side discipline
The NAIC's Third-Party Data and Models Working Group's March 23, 2026 session sketched a vendor registry framework. The registry, if adopted later in 2026, will require AI vendors to file information with regulators on a defined cadence. The registry creates regulator-side visibility; it does not transfer carrier accountability.
Section 4 of the Model Bulletin places the diligence obligation on the carrier regardless of whether the model is built in-house or licensed. The vendor evaluation file every carrier maintains for Tier 1 third-party models:
- Vendor model card with architecture, training data sources, intended use, limitations
- Bias testing artifacts and methodology
- Validation evidence specific to the carrier's deployment context
- Contractual rights to inspect, audit, and demand updates
- Monitoring SLAs with the vendor
- Remediation pathway if the model produces biased or inaccurate outcomes
- Version history and change notification
The pattern that has emerged in 2026: carriers renegotiate vendor contracts to include explicit audit rights, version pinning with change notification, and contractual obligations to participate in the carrier's bias monitoring. Vendors that resist these terms get dropped in favor of those that cooperate. The vendor registry, when it arrives, becomes a procurement input rather than a substitute.
For vendors selling into insurance, the implication is that producing registry-quality documentation now is a competitive advantage. Vendors with documented model cards, validation evidence, bias testing artifacts, and clear update cadences are positioned for the registry; carriers will prefer them in renewals.
Where the loop usually breaks
Patterns that show up across deployed insurance AI products in 2025 and 2026.
Inventory is incomplete. The carrier has many AI models in production, but no central registry. Different business units track their own. When the examiner arrives, the inventory is reconstructed under deadline pressure with inevitable gaps.
Validation evidence is thin. Models in production for years have no documented pre-deployment validation. Backfilling validation evidence after the fact is hard; sometimes the team that did the original work has left.
Continuous monitoring is dashboards without alerts. Metrics exist but no one watches them. Drift goes undetected for months. Bias issues surface in audits rather than internal alerts.
The AI is treated as the decisioner. Operational metrics measure adjusters or underwriters against AI agreement rather than against ground truth. AI deviation gets penalized; conformity gets rewarded. Over time, the AI is functionally the decisioner, with all the legal exposure that creates.
Audit trail is logs, not lineage. Logs of model calls exist but cannot be reassembled into coherent decision histories. Discovery requests take months instead of days. Litigation exposure expands with discovery cost.
Bias monitoring is annual. The annual audit is the only time bias is measured. Drift between audits goes undetected. The same disparate impact patterns reappear year after year.
Vendor oversight is contractual, not operational. The MSA mentions audit rights but the carrier never exercises them. When the vendor's model regresses, the carrier finds out from outcome metrics rather than vendor notification.
Demographic data is reachable from inference paths. No infrastructure-level isolation. Models could be using demographic data even if the team believes they are not. Disparate treatment exposure is direct.
Reversal feedback loops do not exist. When claims are appealed and reversed, the original AI recommendation is not analyzed against the corrected disposition. The model continues operating with persistent error patterns until external pressure forces remediation.
What to expect in the next twelve months
Trends to plan around:
The AI Evaluation Tool gets adopted. Fall 2026 NAIC National Meeting is the anticipated milestone. The pilot states' feedback shapes the version that becomes standard. Adoption in additional states follows over 12-18 months. By end of 2027, most carriers face structured AI examinations.
Vendor registry framework matures. The NAIC's Third-Party Working Group continues drafting through 2026. First state implementations land in late 2026 or early 2027. Vendor procurement shifts in response.
EO 14365 litigation reaches appellate courts. The federal-state preemption question gets meaningful judicial attention. Carriers monitor and continue dual compliance regardless of outcome.
Health insurance AI litigation expands. Estate of Lokken v. UnitedHealth proceeds through discovery. New cases against other major payors emerge. The pattern of AI claims algorithms generating litigation exposure becomes the dominant insurance AI legal narrative.
Colorado leads, others follow. Colorado SB 21-169's expansion to auto and health, plus the broader Colorado AI Act, makes Colorado the strictest binding constraint for many carriers. Other states adopt similar frameworks. Building to the Colorado standard becomes the practical national strategy.
EU AI Act compliance window. Phased through 2026 with full effective date August 2026. Insurance pricing is high-risk; carriers operating in the EU need conformity assessment infrastructure.
Agentic underwriting becomes standard. Cytora Autopilot, Sixfold AI Underwriter, and competitors push the boundary toward end-to-end automated workflows. Carriers that adopt these efficiently scale; carriers that delay lose competitive position.
Adjuster augmentation, not replacement. Despite vendor messaging, the production deployments at major carriers preserve adjuster judgment. Tools that aid adjusters succeed; tools that replace them face litigation environment that makes the replacement unsustainable.
How to get started
If you are starting an insurance AI build today, the priority order:
-
AIS Program inventory and validation file infrastructure first. Without these, the 2026 examination cycle is painful. With them, everything else is incremental work.
-
Provenance and audit trail as foundational architecture. Field-level grounding from ingestion onward; immutable decision records; queryable indexes. Build this before adding features.
-
Demographic data isolation in infrastructure. Schema separation and access controls enforced at the infrastructure layer. Code review and CI checks for join violations.
-
Continuous bias monitoring with alerts. Selection rates, pricing parity, claim outcomes per protected group. Weekly cadence; alerts on threshold breaches.
-
Pick one pattern, ship it deep. Submission processing, claims processing, document extraction, or compliance copilot. Resist building all at once.
-
AI as decision support architecture, not just policy. Operational metrics measure decisions against ground truth, not against AI agreement. Adjuster judgment is protected as the system of record.
The detailed engineering depth lives in the spoke posts:
- The NAIC AI Evaluation Tool: Engineering for the 2026 Pilot
- Building Claims AI Without Becoming the Next nH Predict
- Evaluating Underwriting LLMs
- Building an AI Claims Processing Agent
How Respan fits
The five architectural patterns above (AIS Program foundation, decision support, provenance, bias monitoring, vendor oversight) all rest on the same observability and evaluation substrate, and Respan is built to be that substrate for insurance AI teams. From submission intake through claims disposition, every model call needs to be traced, evaluated, gated, versioned, and monitored in ways an NAIC examiner or a discovery request can interrogate.
- Tracing: every underwriting submission, FNOL intake, coverage determination, and document extraction captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When the Estate of Lokken-style discovery order arrives asking for model versions, prompts, retrieved context, and human review on a specific claim, indexed traces turn months of forensics into seconds of query.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on coverage misclassification, hallucinated policy language, fabricated citations to source documents, disparate selection rates per protected group, and reversal-on-appeal patterns before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Demographic data isolation, PII redaction, and per-tenant routing are enforced at the gateway boundary so models never see fields they should not, and Tier 1 model swaps happen without code changes during examination prep.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Underwriting triage prompts, FNOL intake prompts, coverage determination prompts, fraud-flag rationale prompts, and adjuster-facing summary prompts all belong in the registry so each version is reviewable, attributable, and rollback-able when an examiner asks who approved what and when.
- Monitors and alerts: selection rate per protected group, pricing parity, claim denial rates per group, settlement amounts, time-to-resolution, appeal and reversal rates, citation grounding rate, AI-vs-adjuster disposition agreement. Slack, email, PagerDuty, webhook. Drift gets caught between audits rather than at them, which is the difference between an internal ticket and a market conduct finding.
A reasonable starter loop for insurance AI builders:
- Instrument every LLM call with Respan tracing including submission spans, retrieval spans over policy and prior-claim context, model-version spans, and adjuster-disposition spans.
- Pull 200 to 500 production submissions, FNOLs, or coverage determinations into a dataset and label them for risk selection accuracy, citation grounding, refusal correctness, and disparate impact.
- Wire two or three evaluators that catch the failure modes you most fear (AI recommendations becoming determinative against adjuster judgment, hallucinated citations to nonexistent policy sections, persistent disparate impact across protected groups).
- Put your underwriting triage, FNOL intake, coverage determination, and adjuster summary prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so demographic fields stay isolated from inference paths, PII is redacted before leaving carrier infrastructure, and Tier 1 vendor model swaps happen behind a stable interface.
In a regime where the NAIC AI Evaluation Tool is operationalizing examinations and discovery orders are reshaping claims AI litigation, the carriers and vendors that win are the ones whose observability backbone produces examination-ready and discovery-ready artifacts as a byproduct of how the systems run.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
