In April 2026, Sullivan & Cromwell apologized to a federal bankruptcy judge for filing a brief that cited cases generated by an AI tool. They were not the first AmLaw firm to do this in 2026, and they will not be the last. The Charlotin database, which catalogs court filings affected by AI-generated fabrications, now lists more than 1,353 cases globally. The pace doubled in the first quarter of 2026.

Total sanctions for AI-fabricated citations in Q1 2026 alone reached approximately $145,000 across U.S. courts. Oregon issued a single $110,000 sanction, the largest on record. Nebraska handed down the first known license suspension tied directly to AI fabrications. Sixth Circuit attorneys paid $30,000 in sanctions in March 2026, with the court noting that signing a brief makes you liable for its contents regardless of which tool produced them.

If you build legal AI products, the urgency here is not abstract. Every one of these incidents traces back to an engineering choice: a retrieval pipeline that returned irrelevant cases, a generation step that invented citations to fill a gap, an absent verification layer, an audit trail nobody could reconstruct after the fact. The legal industry has decided that AI hallucinations are no longer the AI vendor's problem. They are the lawyer's problem, and the lawyer's problem rolls back uphill to whoever built the tool.

This post is for the engineers building legal AI today. It covers why hallucinations persist even with RAG and legal-specific tooling, what the 2024 Stanford RegLab evaluation actually showed, and the six engineering fixes that move you from "good enough for demo" to "defensible in court."

The hallucination floor is higher than vendors admit

The widespread assumption in 2023 was that retrieval-augmented generation would solve legal hallucinations. Ground the model in case law, the thinking went, and it stops making things up. LexisNexis went so far as to market Lexis+ AI as delivering "100% hallucination-free linked legal citations." Thomson Reuters made similar claims about Westlaw AI-Assisted Research.

Stanford's RegLab and HAI tested those claims in a 2024 preprint that has since become the benchmark reference. They evaluated Lexis+ AI, Westlaw AI-Assisted Research, and Thomson Reuters Practical Law AI on legal research queries. The headline result: even these legal-specific RAG systems hallucinated on more than 1 in 6 queries. They were better than general-purpose models like GPT-4, which the same study found fabricate citations on 30 to 45 percent of legal research responses depending on query specificity. But "better than GPT-4" is not the bar.

What the Stanford team also documented is more useful for engineers than the headline number. The failure modes split into two categories: outright fabrication (cite to a case that does not exist) and misalignment (cite to a real case that does not actually support the asserted proposition). The misalignment failure is much more dangerous in production because it is harder to catch in casual review. A real case name, a real reporter cite, a real court, a real year. The cite passes a Bluebook check and a quick Westlaw lookup. It just does not stand for what the brief claims it stands for.

The implication for builders: a citation existence check is necessary but not sufficient. You need a separate alignment check, and that alignment check has to verify the holding against the cited proposition. This is much harder than running a regex against a court database.

Why this is structurally hard

Three properties of legal text make hallucinations especially hard to suppress.

First, the long tail. Legal questions cluster around well-trodden statutes and famous cases for the obvious questions, but real practice is overwhelmingly novel fact patterns slotted into obscure precedent. The model has seen the famous cases hundreds of times in pretraining. It has seen the obscure ones once, or in a casebook footnote, or never. When the retrieval system returns a near-miss case rather than the on-point one, the model's instinct is to bridge the gap with synthesis, which is where fabrication enters.

Second, the reward structure. OpenAI's own September 2025 paper on hallucination causes pointed at a structural cause: models trained on benchmarks where abstention scores zero learn that guessing is strategically dominant. A model that says "I don't know" 10 times scores worse on standard accuracy benchmarks than a model that hallucinates 10 times and gets 1 right by luck. This is true across domains, but legal is where the cost of guessing is asymmetric and large.

Third, fluency masks errors. Lawyers and judges read fluent prose as competent prose. The "fluency-truth effect" documented in 2026 cognitive science research applies here directly: when an AI-generated paragraph reads like a well-edited associate's brief, the reader extends trust to the citations in it. Casual review breaks down. The hallucinations that survive review are the ones that read most plausibly, which is also the kind of hallucination LLMs produce best.

Six engineering fixes

The combination of pretraining bias toward guessing, retrieval that returns near-miss precedent, and human reviewers who default to trust means you cannot solve hallucinations at any single layer. You need defense in depth. Below are the six layers that, together, get you from "1 in 6 hallucinations" to something a litigator can sign.

1. Character-level citation grounding

The strongest production pattern is to require every citation in the output to point to a specific span in a retrieved source document. Not just "this case supports this claim" but "this exact sentence in the output is grounded in this exact paragraph of this retrieved case." If you cannot map an output sentence back to a source span, you flag it.

GC AI calls this "character-level citation" in their marketing. Harvey's Vault and Workflow Agents implement a similar grounding model. Implementation typically uses a structured generation pass where the model is constrained to emit citation tokens that reference indices in the retrieved context, then a post-processing step that validates each citation maps to a real source location.

This stops one whole class of failures: pure fabrication of cites that never appeared in the retrieval set. It does not stop misalignment. The cite exists in your context window, the model just used it to support a claim it does not support.

2. Refusal training and abstention

The OpenAI hallucination paper made a specific recommendation: train models to abstain when confidence is low, and structure your eval so abstention is rewarded over confident wrong answers. For legal AI, this means your fine-tuning data needs explicit "I don't have a clear authority for this" responses, and your judge needs to score those responses positively.

In practice, most legal AI products do not do this. They optimize for completion rate because users perceive abstention as "the AI doesn't work." But abstention is exactly what you want when the user is about to file a brief. The reframing for product is: abstention is not failure, it is the AI signaling "do this part by hand." Make that signal visible to the lawyer instead of hiding it.

3. Multi-stage retrieval

Single-shot retrieval gets you the top-k cases by embedding similarity to the query. That is a weak relevance signal for legal questions, where the query is often a fact pattern and the relevant case is matched by reasoning rather than surface similarity.

Strong production pipelines layer:

Query expansion (rewrite the user's question into 3 to 5 retrieval queries that cover different angles)
Hybrid retrieval (BM25 plus dense embedding, with reranking)
Per-jurisdiction filtering (a 9th Circuit case is irrelevant for an Eastern District of Texas filing, no matter how on-point the holding)
Recency reranking (a 2024 case overruling a 1998 case should sort above the 1998 case)
Final relevance check (LLM-as-judge filters retrieved cases to "actually on point" before they reach generation)

Each layer compounds. A 70 percent recall single-shot retrieval becomes 95 percent recall multi-stage. Hallucinations drop because the model has the right context to answer with rather than the near-miss context it has to bridge.

4. Citation existence and alignment verification

After generation, run two automated checks.

Existence check. Every citation in the output is verified against a canonical legal database (Westlaw, Lexis, CourtListener, Caselaw Access Project, or your own indexed corpus). If the cite does not resolve, flag it.

Alignment check. For each citation that exists, run an LLM-as-judge that takes the cited case text and the asserted claim, and asks: "does this case actually stand for this proposition?" The judge prompt needs careful design to avoid the same fluency-truth effect that fooled the original generator. You want the judge skeptical, not charitable.

The alignment check is where most production legal AI systems are weakest. It is much easier to verify a cite exists than to verify it is on point. But the misalignment failure mode is the one that gets sanctioned, because the cite passes any review that does not pull and read the underlying case. Stanford RegLab's 2024 study explicitly framed misalignment as the harder and more common failure.

5. Production tracing and audit log

Every legal AI request needs to produce an audit trail that a careful associate could reconstruct months later. Specifically:

The exact user query
Which retrieval queries were generated
Which cases were retrieved (with retrieval scores)
Which cases the model was given as context
The model and prompt version
The full generated output before any post-processing
Which citations passed or failed verification, and why
Any human edits applied after generation

This is what gives you defensibility under ABA Formal Opinion 512's candor and supervisory obligations. When a sanctions hearing asks "how did this fabricated cite end up in your brief," you should be able to answer with a trace, not with a shrug.

In Respan, this is what tracing is built for. A workflow decorator captures every span of the agent loop, persists the inputs and outputs, and lets you query historical traces by matter, attorney, or filing date. Sample below in the engineering walkthrough.

6. Continuous eval set capture from production

The single highest-leverage thing legal AI engineers do not do enough of: every hallucination caught in production goes into the eval set. Every lawyer override goes into the eval set. Every "this cite is wrong" feedback signal goes into the eval set.

The reason this matters: model providers ship updates constantly. Stanford RegLab specifically called out that legal AI tool performance fluctuates across model updates in unpredictable ways. The only way you know whether the update broke your grounding is if you have a frozen eval set from your worst real-world failures and you run it on every model upgrade.

This is where the loop closes. Tracing captures the production failure. The failure goes into a dataset. The dataset runs on every model and prompt change. Regressions are caught before deployment. The hallucinations that did happen do not happen twice.

What the engineering loop looks like

Putting the six layers together, the production architecture for a defensible legal research agent looks roughly like this:

# Pseudocode using Respan-style decorators. Verify SDK syntax against current docs.
import respan
from legal_db import westlaw_lookup, caselaw_search
 
@respan.trace(matter_id_param="matter_id")
def legal_research_agent(query: str, jurisdiction: str, matter_id: str):
    # 1. Query expansion
    queries = expand_query(query, n=5)
 
    # 2. Multi-stage retrieval
    candidates = []
    for q in queries:
        candidates.extend(hybrid_retrieve(q, jurisdiction=jurisdiction))
    candidates = rerank(candidates, query)
    candidates = filter_by_jurisdiction(candidates, jurisdiction)
    relevant = llm_relevance_filter(candidates, query)
 
    # 3. Generation with character-level citation grounding
    response = generate_with_citations(query, relevant)
 
    # 4. Existence + alignment verification
    for cite in response.citations:
        if not westlaw_lookup(cite.id):
            cite.flag = "fabricated"
            continue
        if not alignment_judge(cite, response.claim):
            cite.flag = "misaligned"
 
    return response
 
 
# Continuous eval pulls failures back into the dataset
@respan.eval(name="citation-grounding")
def citation_grounding_judge(trace):
    output = trace.output
    flagged = [c for c in output.citations if c.flag is not None]
    return {
        "score": 1.0 - (len(flagged) / max(len(output.citations), 1)),
        "fabricated": [c for c in flagged if c.flag == "fabricated"],
        "misaligned": [c for c in flagged if c.flag == "misaligned"],
    }
 
 
# Auto-capture production failures into the eval dataset
respan.datasets.append_from_traces(
    name="legal-research-prod-failures",
    filter={"tag": "lawyer_rejected_citation"},
    window="last_30d",
)
 
 
# Run the eval on every model or prompt change
results = respan.evaluate(
    target=legal_research_agent,
    dataset="legal-research-prod-failures",
    evals=[citation_grounding_judge],
)

The decorators capture the trace, the eval runs on production-derived datasets, and the grounding judge reports per-citation pass and fail. When a model update lands, you re-run this against the same dataset and see immediately whether grounding regressed.

The point is not that this exact code shape is correct for your stack. The point is that the loop has to exist somewhere. Without tracing, you cannot reconstruct what happened in a sanctions hearing. Without grounding eval, you do not know whether your latest prompt change shipped a regression. Without continuous capture, your eval set is frozen at the day you wrote it and is irrelevant six months later.

What to ship this week

If you are building legal AI today and you do not yet have all six layers, the priority order is roughly:

Tracing first. If you cannot reconstruct a request after the fact, nothing else matters. This is also the easiest layer to ship.
Citation existence check. A simple post-generation lookup against a legal database catches the most embarrassing failures (pure fabrications) and is a one-day project.
Alignment judge. Harder, requires LLM-as-judge prompt engineering, but this is where the high-stakes failures get caught.
Multi-stage retrieval. Higher implementation cost, biggest accuracy lift.
Continuous eval capture. Easy once tracing is in place, this is mostly plumbing.
Refusal training and abstention. Hardest, requires fine-tuning or careful prompt design plus product decisions about how to surface "I don't know" to users.

The 2026 sanctions wave is not going to slow down. Charlotin's database adds new cases weekly. Every firm that ends up there is a firm whose AI tools failed silently somewhere in this stack. Build the stack so yours does not.

How Respan fits

The six engineering fixes above (character-level grounding, refusal training, multi-stage retrieval, citation existence and alignment verification, production tracing, continuous eval capture) all live on top of the same substrate, and Respan is built to be that substrate for legal AI teams.

Tracing: every legal research request captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a sanctions hearing asks how a fabricated cite ended up in a brief, you reconstruct the query, the expanded retrieval queries, the retrieved cases, the prompt version, and the verification flags from one trace instead of digging through scattered logs.
Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on citation fabrication, citation misalignment, and missed abstentions before deploys ship.
Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Switching between GPT-4 class models, Claude, and a legal fine-tune for the alignment judge becomes a config change rather than a refactor, and per-matter spend caps keep a runaway agent loop from burning a client's budget on a single brief.
Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. The query expansion prompt, the citation grounding prompt, and the alignment judge prompt all belong in the registry so a partner-approved prompt version is the one shipping to production.
Monitors and alerts: citation fabrication rate, alignment failure rate, abstention rate, jurisdiction filter hit rate, lawyer override rate. Slack, email, PagerDuty, webhook. A spike in fabrications after a model provider update fires before a single brief leaves the firm.

A reasonable starter loop for legal AI builders:

Instrument every LLM call with Respan tracing including retrieval spans, generation spans, and verification spans.
Pull 200 to 500 production legal research traces into a dataset and label them for fabrication, misalignment, and missed abstention.
Wire two or three evaluators that catch the failure modes you most fear (pure citation fabrication, holding misalignment, and confident answers where the correct response is "I don't have a clear authority").
Put your query expansion, citation grounding, and alignment judge prompts behind the registry so you can version, A/B, and roll back without a deploy.
Route through the gateway so model upgrades, fallbacks, and per-matter spend caps are config rather than code, and a regression on one model can fail over to another while you investigate.

With the Charlotin database adding cases weekly and Q1 2026 sanctions already at $145,000, the firms that stay out of the next quarter's report are the ones whose stack catches the hallucination before the brief is filed.

To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.

Building a Citation Grounding Eval for Legal AI: the eval methodology in depth
ABA Formal Opinion 512 for Engineers: translating the bar's compliance rules into technical requirements
Building an AI Contract Review Agent: full code walkthrough
How Legal AI Teams Build LLM Apps in 2026: pillar overview

Try this with your stack. Download the Legal Citation Grounding Starter, a reference repository with a minimal eval set, an alignment judge prompt, and a tracing setup you can drop into an existing legal AI product. If you want to talk through your specific architecture, book a call with our team.

The hallucination floor is higher than vendors admit

Why this is structurally hard

Three properties of legal text make hallucinations especially hard to suppress.

Six engineering fixes

1. Character-level citation grounding

2. Refusal training and abstention

3. Multi-stage retrieval

Strong production pipelines layer:

Query expansion (rewrite the user's question into 3 to 5 retrieval queries that cover different angles)
Hybrid retrieval (BM25 plus dense embedding, with reranking)
Per-jurisdiction filtering (a 9th Circuit case is irrelevant for an Eastern District of Texas filing, no matter how on-point the holding)
Recency reranking (a 2024 case overruling a 1998 case should sort above the 1998 case)
Final relevance check (LLM-as-judge filters retrieved cases to "actually on point" before they reach generation)

4. Citation existence and alignment verification

After generation, run two automated checks.

5. Production tracing and audit log

Every legal AI request needs to produce an audit trail that a careful associate could reconstruct months later. Specifically:

The exact user query
Which retrieval queries were generated
Which cases were retrieved (with retrieval scores)
Which cases the model was given as context
The model and prompt version
The full generated output before any post-processing
Which citations passed or failed verification, and why
Any human edits applied after generation

6. Continuous eval set capture from production

What the engineering loop looks like

Putting the six layers together, the production architecture for a defensible legal research agent looks roughly like this:

# Pseudocode using Respan-style decorators. Verify SDK syntax against current docs.
import respan
from legal_db import westlaw_lookup, caselaw_search
 
@respan.trace(matter_id_param="matter_id")
def legal_research_agent(query: str, jurisdiction: str, matter_id: str):
    # 1. Query expansion
    queries = expand_query(query, n=5)
 
    # 2. Multi-stage retrieval
    candidates = []
    for q in queries:
        candidates.extend(hybrid_retrieve(q, jurisdiction=jurisdiction))
    candidates = rerank(candidates, query)
    candidates = filter_by_jurisdiction(candidates, jurisdiction)
    relevant = llm_relevance_filter(candidates, query)
 
    # 3. Generation with character-level citation grounding
    response = generate_with_citations(query, relevant)
 
    # 4. Existence + alignment verification
    for cite in response.citations:
        if not westlaw_lookup(cite.id):
            cite.flag = "fabricated"
            continue
        if not alignment_judge(cite, response.claim):
            cite.flag = "misaligned"
 
    return response
 
 
# Continuous eval pulls failures back into the dataset
@respan.eval(name="citation-grounding")
def citation_grounding_judge(trace):
    output = trace.output
    flagged = [c for c in output.citations if c.flag is not None]
    return {
        "score": 1.0 - (len(flagged) / max(len(output.citations), 1)),
        "fabricated": [c for c in flagged if c.flag == "fabricated"],
        "misaligned": [c for c in flagged if c.flag == "misaligned"],
    }
 
 
# Auto-capture production failures into the eval dataset
respan.datasets.append_from_traces(
    name="legal-research-prod-failures",
    filter={"tag": "lawyer_rejected_citation"},
    window="last_30d",
)
 
 
# Run the eval on every model or prompt change
results = respan.evaluate(
    target=legal_research_agent,
    dataset="legal-research-prod-failures",
    evals=[citation_grounding_judge],
)

What to ship this week

If you are building legal AI today and you do not yet have all six layers, the priority order is roughly:

Tracing first. If you cannot reconstruct a request after the fact, nothing else matters. This is also the easiest layer to ship.
Citation existence check. A simple post-generation lookup against a legal database catches the most embarrassing failures (pure fabrications) and is a one-day project.
Alignment judge. Harder, requires LLM-as-judge prompt engineering, but this is where the high-stakes failures get caught.
Multi-stage retrieval. Higher implementation cost, biggest accuracy lift.
Continuous eval capture. Easy once tracing is in place, this is mostly plumbing.
Refusal training and abstention. Hardest, requires fine-tuning or careful prompt design plus product decisions about how to surface "I don't know" to users.

How Respan fits

Tracing: every legal research request captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a sanctions hearing asks how a fabricated cite ended up in a brief, you reconstruct the query, the expanded retrieval queries, the retrieved cases, the prompt version, and the verification flags from one trace instead of digging through scattered logs.
Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on citation fabrication, citation misalignment, and missed abstentions before deploys ship.
Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Switching between GPT-4 class models, Claude, and a legal fine-tune for the alignment judge becomes a config change rather than a refactor, and per-matter spend caps keep a runaway agent loop from burning a client's budget on a single brief.
Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. The query expansion prompt, the citation grounding prompt, and the alignment judge prompt all belong in the registry so a partner-approved prompt version is the one shipping to production.
Monitors and alerts: citation fabrication rate, alignment failure rate, abstention rate, jurisdiction filter hit rate, lawyer override rate. Slack, email, PagerDuty, webhook. A spike in fabrications after a model provider update fires before a single brief leaves the firm.

A reasonable starter loop for legal AI builders:

Instrument every LLM call with Respan tracing including retrieval spans, generation spans, and verification spans.
Pull 200 to 500 production legal research traces into a dataset and label them for fabrication, misalignment, and missed abstention.
Wire two or three evaluators that catch the failure modes you most fear (pure citation fabrication, holding misalignment, and confident answers where the correct response is "I don't have a clear authority").
Put your query expansion, citation grounding, and alignment judge prompts behind the registry so you can version, A/B, and roll back without a deploy.
Route through the gateway so model upgrades, fallbacks, and per-matter spend caps are config rather than code, and a regression on one model can fail over to another while you investigate.

To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.

Building a Citation Grounding Eval for Legal AI: the eval methodology in depth
ABA Formal Opinion 512 for Engineers: translating the bar's compliance rules into technical requirements
Building an AI Contract Review Agent: full code walkthrough
How Legal AI Teams Build LLM Apps in 2026: pillar overview

Why Legal AI Still Hallucinates Citations

The hallucination floor is higher than vendors admit

Why this is structurally hard

Six engineering fixes

1. Character-level citation grounding

2. Refusal training and abstention

3. Multi-stage retrieval

4. Citation existence and alignment verification

5. Production tracing and audit log

6. Continuous eval set capture from production

What the engineering loop looks like

What to ship this week

How Respan fits

Built for AI agents.
Break less.
Ship more.

Why Legal AI Still Hallucinates Citations

The hallucination floor is higher than vendors admit

Why this is structurally hard

Six engineering fixes

1. Character-level citation grounding

2. Refusal training and abstention

3. Multi-stage retrieval

4. Citation existence and alignment verification

5. Production tracing and audit log

6. Continuous eval set capture from production

What the engineering loop looks like

What to ship this week

How Respan fits

Built for AI agents.
Break less.
Ship more.

Why Legal AI Still Hallucinates Citations

The hallucination floor is higher than vendors admit

Why this is structurally hard

Six engineering fixes

1. Character-level citation grounding

2. Refusal training and abstention

3. Multi-stage retrieval

4. Citation existence and alignment verification

5. Production tracing and audit log

6. Continuous eval set capture from production

What the engineering loop looks like

What to ship this week

How Respan fits

Related reading

Built for AI agents. Break less. Ship more.

Why Legal AI Still Hallucinates Citations

The hallucination floor is higher than vendors admit

Why this is structurally hard

Six engineering fixes

1. Character-level citation grounding

2. Refusal training and abstention

3. Multi-stage retrieval

4. Citation existence and alignment verification

5. Production tracing and audit log

6. Continuous eval set capture from production

What the engineering loop looks like

What to ship this week

How Respan fits

Related reading

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.