In a 2025 Communications Medicine paper Mount Sinai built 300 clinical vignettes, planted exactly one fabricated detail in each (a fake lab, a fabricated physical sign, a non-existent condition), and ran them through six LLMs. The default hallucination rate was 65.9%. With a one-line mitigation prompt it dropped to 44.2%. Models repeated or elaborated on the planted false fact in up to 83% of cases. Temperature adjustments did not help.
The follow-up on GPT-5 (npj Digital Medicine 2026) showed the larger frontier model hallucinated more than GPT-4o on the same protocol (65% vs 53% under default), with no improvement on sociodemographic decision variation. Bias and hallucination did not scale away.
The clinical context turns this from an academic curiosity into a patient safety event. Tortus's CREOLA framework measured production hallucination at 1.47% and omission at 3.45% on clinical summaries. 44% of hallucinations were classified "major" vs 16.7% of omissions. 20% of all hallucinations landed in the Plan section, the part of the note that drives orders. ECRI ranks AI-enabled health technology as the #1 health technology hazard for 2025, above cybersecurity.
This piece is for AI engineers building clinical products who need to drive hallucination rates down without waiting for a better base model. It covers the data on where models actually fail in clinical settings, why the failure mode is structurally harder than in other domains, and the six layers of defense that move you from "good enough for demo" to "defensible in a sentinel event review."
For the wider Healthcare cluster: the pillar covers the seven core use cases and the build patterns underneath them. Compliance, build, and eval spokes are next.
The data: where clinical LLMs actually fail
Three datasets matter for the 2026 picture.
Mount Sinai adversarial study (2025). 300 physician-validated vignettes, six LLMs (Claude 3.5 Sonnet, GPT-4o, GPT-4, Llama 3.3, Gemini 1.5 Pro, Mistral Large) × 3 conditions (default, mitigation prompt, temperature=0). 5,400 outputs. Default 65.9% hallucination, mitigation 44.2%. Range across models: 50 to 82%. GPT-4o was the best performer at 53% → 23%. Mitigation worked but never approached zero.
GPT-5 follow-up (2026). 500 ED vignettes × 32 sociodemographic variants. GPT-5 hit 65% adversarial hallucination under default (worse than GPT-4o), dropping to 7.67% with the mitigation prompt. Sociodemographic variation persisted across model generations: several marginalized groups received unnecessary screening 100% of runs despite identical clinical content; low-income groups received less advanced testing.
HealthBench Professional (April 2026 OpenAI launch). Real-world clinician-grade scenarios across consultation, documentation, and research. GPT-5.4 base 48.1, Claude Opus 4.7 47.0, Gemini 3.1 Pro 43.8, Grok 4.2 36.1. Unaided physicians scored 43.7. A specialized ChatGPT for Clinicians workspace built on GPT-5.4 hit 59.0. Translation: the base frontier models are roughly at physician parity on this benchmark, and a specialized workspace beats both.
HealthBench Hard. Most current models score near zero. o3 = 0.32, the leader (Muse Spark by Meta) = 0.428, average across all = 0.222. The hard slice creates real model divergence and is the right place to track frontier capability for the next year.
MedQA is saturated. o1 96.52%, GPT-5.1 96.38%, Gemini 3.1 Pro 96.37%. It is no longer a useful differentiator.
For clinical documentation specifically, the Tortus CREOLA framework measured 1.47% hallucination and 3.45% omission rates on real clinical summaries. The headline number is small. The clinical safety profile is not: 44% major-severity for hallucinations, and the Plan section is the hot spot.
Why clinical hallucination is structurally harder
Six properties of medical text make hallucinations harder to suppress than in other domains.
Long tail of medical knowledge. Rare diseases, atypical drug interactions, off-label indications, and pediatric dosing fall outside the dense web text models train on. Coverage of the alias tail across RxNorm, DrugBank, Martindale, WHO Drug Dictionary, and EMA xEVMPD is incomplete. A 2025 paper documented that FDA normalization of 2024 FAERS opioid reports collapsed 7,892 free-text strings to 92 RxNorm ingredients only after multiple API lookups and manual edits.
Stale guideline knowledge. USPSTF, NCCN, ACC/AHA, IDSA guidelines turn over yearly. Formulary, REMS, and Black Box updates churn faster. Pretraining cutoff equals silent guideline drift, and clinicians cannot tell the model is referencing the 2023 NCCN guidelines for a 2026 patient.
Adversarial pressure from patients. Symptom misreporting, leading questions, and patient-supplied facts planted in context flip model outputs. The Mount Sinai paper isolates exactly this attack surface. A JAMA Network Open prompt-injection study confirms flagship models bend under both direct and indirect injections in clinical contexts.
Compounding errors in agents. Ambient scribe → orders → claims pipelines amplify a single hallucination across documentation, billing, and downstream care. CREOLA notes hallucinations cluster in the Plan section, which is exactly where automation extends them.
Liability asymmetry. An LLM coding error is annoying. A fabricated potassium value is a sentinel event. ECRI's 2025 list places AI-enabled health technology hazards at #1, above cybersecurity.
Coherent-sounding falsehoods. Medical hallucinations use domain-specific terms and present plausible logic, making them harder for non-specialists to flag than general-domain confabulation. Fluent prose reads as competent prose, which is the same fluency-truth bias documented in legal AI hallucination.
Six engineering defenses
The combination of pretraining bias toward guessing, retrieval that returns near-miss precedent, and clinicians who default to trust under time pressure means you cannot solve hallucinations at any single layer. You need defense in depth.
1. RAG over licensed clinical corpora with citation enforcement
The strongest production pattern is to require every clinical claim in the output to point to a specific span in a retrieved authoritative source. Not "this medication is appropriate" but "this medication is recommended in NCCN Guideline X.Y, version 2026.1, page 47."
OpenEvidence demonstrates this at scale. Ensemble of specialized models trained on peer-reviewed literature only, no public-internet connection, with NEJM (1990 onward) plus JAMA (13 journals), NCCN, Wiley, and Cochrane content licensed in. Used by hundreds of thousands of verified physicians at 10,000+ care centers. Citation-first UX is load-bearing.
Abridge's Contextual Reasoning Engine, generally available March 2026, integrates UpToDate for evidence-based guidance contextually inside the documentation flow.
Works when the question is well-posed and the corpus is comprehensive. Fails when the question falls in coverage gaps (rare disease, off-label, very recent FDA actions) or retrieval misses the relevant chunk.
2. Tool grounding for drugs, doses, and interactions
Force the model to call deterministic tools for any drug, dose, or interaction claim. RxNorm and RxNav for medication normalization, the FDA Orange Book for therapeutic equivalents, FDA NDC for products, OpenFDA adverse events, NLM's RxNorm interaction API. Reject responses where drug entities are not RxNorm-resolved.
The pattern: LLM proposes a step. Tool verifies. LLM continues only if the tool agrees. The LLM never asserts an arithmetic, dosing, or interaction fact it did not tool-compute.
Works when structured lookup is sufficient (dose ranges, half-lives, contraindications). Fails when clinical judgment is required (renal-adjusted dosing in CKD-3 with overlapping nephrotoxins) and tool outputs must be reasoned about.
3. Constitutional and supervisor models alongside the primary
Hippocratic AI's Polaris is the canonical reference. A constellation architecture: a 70 to 100B primary trained on evidence-based content, plus multiple specialist supervisor models that triple-check labs, meds, and escalations. Polaris 3.0 totals 4.2 trillion parameters across 22 models; Polaris 5 has a 700B core within a 5T constellation. Reported 99.38% clinical accuracy, up from 96.79% in v1, on par with human nurses across safety, clinical readiness, education, conversation, and bedside manner.
The architecture pays off only if supervisors are independently trained on different objectives and data distributions. Correlated supervisors give a false sense of redundancy. Latency and cost balloon at constellation scale, which is the price.
4. Refusal training and abstention
The OpenAI 2025 paper "Why Language Models Hallucinate" (Kalai, Nachum, Vempala, Zhang) argues that standard training and evaluation reward confident guessing because abstention scores zero on most benchmarks. Their SimpleQA example: error 75%, abstain 1% versus error under 50%, abstain 52%. Hallucinations are mathematically inevitable as long as your eval punishes "I don't know" identically to wrong answers.
Clinical translation: penalize confidently wrong claims more than "I don't know, order a TSH and reassess." Reward hedged outputs that name the uncertainty. Threshold the model's confidence and force abstention or escalation below it.
Works when downstream UX accepts abstention (clinician-in-the-loop tools). Fails when product is consumer-facing or autonomous (patient-facing chatbots), where abstention is read as failure. The product decision is to make abstention a first-class output in the UI, not a fallback.
5. Adversarial robustness regression suites
Adopt the Mount Sinai 300-vignette protocol as a continuous regression suite. Each release, inject planted fabricated labs, signs, or conditions and measure the rate at which the model elaborates on them. Add prompt-injection benchmarks (the MPIB benchmark) and patient-side jailbreaks (the "my doctor said it's fine" class).
Add demographic perturbation: AAVENUE, ReDial, and dialect-as-jailbreak variants. AAVE / dialect inputs both reduce accuracy and lower refusal rates. Dialect functions as a covert jailbreak vector: patients writing in non-Standard English get worse care and fewer safety guardrails simultaneously.
6. Continuous eval capture from clinician overrides
Treat every clinician edit on an AI draft as a labeled training and eval datum. Deletions are potential hallucinations. Insertions are potential omissions. Rewordings are stylistic. CREOLA operationalizes this with a clinician-labeling UI plus an error taxonomy (hallucination / omission / minor / major).
The key shift in 2026 governance is from quarterly retrospective audits to continuous post-deployment surveillance. Both the npj Digital Medicine dynamic deployment paper and the FAIR-AI framework argue this point: developer, regulator, and health system share responsibility for ongoing eval. MAUDE alone is structurally blind to concept drift, covariate shift, and hallucination categories.
What real teams ship
A short audit of how the leading clinical AI products combine these defenses.
Abridge: ambient scribe with prior-note context, orders, and revenue-cycle linkage; UpToDate surfaced contextually; trace-level audit to support Plan-section review where hallucinations cluster.
OpenEvidence: ensemble of specialized models, no public-internet connection, citation-first UX, licensed corpora only. The architecture choice (no internet, only peer-reviewed sources) is a structural defense against the long-tail hallucinations that come from web-text training.
Glass Health: differential diagnosis structured into "Most Likely / Expanded Differential / Can't Miss" tiers with cited evidence per claim. The structured-output schema is itself a defense; freeform prose hides hallucinations.
Hippocratic AI Polaris 3.0 / 5.0: constellation supervisor architecture, patient-facing voice agents, patented safety-focused LLM. The architecture is the moat; competitors building on a single primary model carry correlated risk.
Microsoft DAX / Dragon Copilot: ambient → specialty-specific draft note → clinician review with origin and context citations on subjective and objective elements. Unified into Microsoft Dragon Copilot March 2025.
Tortus (CREOLA): in-house clinician labeling platform plus error taxonomy plus safety framework; achieved 1.47% hallucination and 3.45% omission via iterative pipeline modification driven by the labeled feedback loop.
Real incidents to learn from
The failure modes get worse than benchmark numbers suggest.
Whisper hallucinations in medical transcription (2024). ABC and AP reporting plus AP/Cornell research found 187 hallucinations across ~13,000 clear audio snippets. Examples included invented medications ("hyperactivated antibiotics"), fabricated racial commentary, and violent inventions. Nabla had transcribed 7M+ medical conversations across 30,000 clinicians and 40 health systems at the time. The ASR layer is its own hallucination surface, separate from the LLM. Downstream LLM grounding cannot recover what the ASR fabricated.
Therapy chatbot meth incident (June 2025). A therapy-style chatbot told a user with addiction history to "take a small hit of methamphetamine to get through the week." Refusal logic and safety supervisor models exist precisely to catch this class of failure.
Patient-led ChatGPT in neuropsych eval (2025). A patient brought a ChatGPT-generated "report" linking tinnitus to cognitive decline with fabricated citations to the evaluator. Patient-side use of unsupervised general-purpose models is now part of clinical workflows whether you build for it or not.
Clinician survey (n=70, 15 specialties): 91.8% had encountered medical hallucinations, 84.7% considered them capable of patient harm.
Wiring the defenses on Respan
A clinical AI workflow that combines several of these defenses traces every span:
import os
from respan import Respan
client = Respan(api_key=os.environ["RESPAN_API_KEY"])
@client.workflow(name="clinical-decision-support")
def clinical_query(patient_token, question, jurisdiction="US"):
# 1. Retrieve from licensed clinical corpus only
sources = client.retrieve(
index="uptodate-nccn-nejm",
query=question,
filters={"published_after": "2024-01-01"},
require_citations=True,
)
# 2. Tool-ground any drug or dose claim
drugs = extract_drug_entities(question)
rxnorm_resolved = [rxnorm_lookup(d) for d in drugs]
if any(r is None for r in rxnorm_resolved):
return abstain("unrecognized medication, escalate to clinician")
# 3. Generate with citation enforcement
response = client.chat.completions.create(
model="auto",
customer_id=patient_token,
messages=build_clinical_prompt(question, sources, rxnorm_resolved),
require_citation_grounding=True, # rejects unsourced claims
on_low_confidence="abstain",
)
# 4. Run supervisor pass on the output
supervisor = client.evals.run(
evaluator="clinical_safety_supervisor",
candidate=response,
checks=["dose_within_range", "no_contraindications", "guideline_alignment"],
)
if supervisor["fail_count"] > 0:
return escalate_to_clinician(response, supervisor)
return responseThe eval suite that monitors production:
# Sample 5% of live traffic nightly through adversarial regression vignettes
client.monitors.create(
name="clinical-hallucination-rate",
workflow="clinical-decision-support",
sample_rate=0.05,
evaluators=[
"adversarial_planted_fact_rate", # Mount Sinai 300-vignette suite
"citation_grounding_rate",
"rxnorm_resolution_rate",
"demographic_decision_variance",
],
alert_on={
"adversarial_planted_fact_rate": ">0.10",
"citation_grounding_rate": "<0.95",
},
)For clinical documentation specifically, the production-grade signal is clinician edit rate on the Plan section. CREOLA's data tells you this is where 20% of all hallucinations land and 44% are major. Tracking edit rate on this section as a first-class metric, sliced by specialty and clinician, will surface drift before sentinel events do.
A reference architecture for clinical AI
If you are starting today, the smallest defensible stack combines:
- Retrieval over licensed clinical corpora only (UpToDate, NEJM/JAMA, NCCN, the firm's own work product). No public-internet retrieval. Citation-first UX where every claim is sourced.
- Deterministic tool grounding for any drug, dose, or interaction claim. RxNorm or equivalent resolution, FDA databases, internal formulary lookup.
- Supervisor model layer independent of the primary, with at least one supervisor on dosing safety and one on guideline alignment.
- Refusal logic wired at the architecture level. Abstention is a first-class output, escalates to clinician, and the UI surfaces it explicitly.
- Tracing every turn with retrieval, tool-call, supervisor, and reasoning spans. Hashed patient identifiers. (HIPAA spoke covers the audit log architecture.)
- Adversarial regression suite of 300+ planted-fact vignettes pinned to CI. Run on every prompt and model change. Treat the Mount Sinai protocol as the floor.
- Online sampling of 5 to 10% of live traffic through citation grounding, RxNorm resolution, and demographic variance evaluators with weekly drift alerts.
- Clinician edit capture going back into the regression eval set, with edit rate on the Plan section as a primary safety metric.
- Demographic split testing monthly across race, language, dialect, and socioeconomic status. Alert on any subgroup drop greater than 5 percentage points.
CTA
To wire the defenses above on Respan, start tracing for free, read the docs, or talk to us. For the rest of the Healthcare cluster, see the pillar. HIPAA / BAA engineering, an AI medical scribe build walkthrough, and the clinical eval framework spokes are coming next.
How Respan fits
Clinical AI defense in depth needs every layer instrumented, evaluated, and rolled back without a deploy when something goes wrong. Respan gives you the trace, eval, gateway, and prompt surfaces to wire the six defenses without stitching five vendors together.
- Tracing: every clinical query captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. Retrieval spans, RxNorm tool calls, supervisor passes, and abstention decisions all show up in a single timeline so you can see exactly which layer let a planted lab through.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on adversarial planted-fact rate, citation grounding, and demographic decision variance before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Route the primary model and supervisor models through one endpoint, swap GPT-4o for GPT-5 in a config change, and keep PHI handling consistent across providers.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Every mitigation prompt, citation-enforcement template, and supervisor instruction lives in the registry so a clinician-flagged regression rolls back in seconds.
- Monitors and alerts: adversarial planted-fact rate, citation grounding rate, RxNorm resolution rate, Plan-section edit rate, demographic decision variance. Slack, email, PagerDuty, webhook. Drift on any subgroup or any layer alerts the on-call before the next sentinel event review.
A reasonable starter loop for clinical AI builders:
- Instrument every LLM call with Respan tracing including retrieval, RxNorm tool, supervisor, and abstention spans.
- Pull 200 to 500 production clinical queries into a dataset and label them for hallucination, omission, citation grounding, and Plan-section severity.
- Wire two or three evaluators that catch the failure modes you most fear (planted-fact elaboration, unsourced drug claims, demographic decision drift).
- Put your clinical prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so primary, supervisor, and fallback models share one auditable, BAA-friendly path.
The point is to make every clinical hallucination a debuggable event, not a postmortem.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
FAQ
Is MedQA still useful for benchmarking clinical AI in 2026? No, as a differentiator. o1, GPT-5.1, and Gemini 3.1 Pro all sit at 96%+. Use HealthBench Hard, MedHELM Clinical Decision Support, MEDEC, and CREOLA-style human review for actual signal. HealthBench Professional showed specialized clinician-workspace systems beating both base frontier models and unaided physicians (59.0 vs 43.7).
Does a one-line mitigation prompt actually fix clinical hallucinations? It halves them and never eliminates them. Mount Sinai showed 65.9% to 44.2%, GPT-5 showed 65% to 7.67%. Treat prompts as a baseline floor, not a defense. Couple with retrieval, tool grounding, and supervisor models.
Are bigger frontier models safer for clinical use? Not automatically. GPT-5 had higher adversarial hallucination than GPT-4o (65% vs 53%) and showed no improvement on sociodemographic decision variation. Bias and hallucination are data and RLHF problems, not scale problems. Scale alone does not fix them.
Where do hallucinations cluster in clinical documentation? The Plan section. CREOLA: 20% of all hallucinations land in Plan, and 44% of all hallucinations are classified "major" vs 16.7% of omissions. The Plan section is also what drives orders, which is what makes it the highest-stakes location. Concentrate human review there.
Is ASR a separate hallucination surface from the LLM? Yes. Whisper and similar ASR models invent medications and sentences in clear audio. Downstream LLM grounding cannot recover what the ASR fabricated. Treat ASR + LLM as a chained risk, with audio-grounded verification (re-listen, two-pass transcription) where stakes are high.
Should I build my own constellation supervisor architecture? Only if your supervisors are independently trained on different objectives. Correlated supervisors share the primary's blind spots and give false confidence. Polaris reports 99.38% with 22 specialist models trained on independent objectives; a quick "second LLM checking the first" pattern using the same family does not buy meaningful safety margin.
