Legal AI products live and die on citations. Every claim in a memo, brief, or contract needs to point to an authority that actually exists, that actually says what the claim says it says, and that comes from a jurisdiction that actually binds. Get any of these wrong and you ship a hallucination. Ship enough hallucinations and your firm pays sanctions or your tool gets pulled.
The 2024 Stanford RegLab study showed that even legal-specific RAG systems like Lexis+ AI and Westlaw AI-Assisted Research hallucinate on more than 1 in 6 queries. General-purpose LLMs fabricate citations on 30 to 45 percent of legal research queries. These numbers are not improving on their own with bigger models. The fix is engineering: a grounding eval that catches the failures before users do.
This post is a build guide. It covers how to construct a golden dataset for citation grounding, how to write an LLM-as-judge that actually catches misalignment (the hard failure mode), how to set up jurisdiction filtering, and how to wire continuous eval capture so your grounding scores stay current as models update.
What "grounding" actually means
The word gets used loosely. For a citation grounding eval to be useful, you need to break it into three independent checks.
Existence. Does the citation point to a real legal authority? A real case in a real court, decided on a real date, reported in a real reporter. A real statutory provision in the cited code. A real regulation in the cited CFR section. This is the easiest layer to verify automatically.
Alignment. Does the cited authority actually support the asserted proposition? A real case can be cited for a holding it does not have. The case exists, the cite passes a Bluebook check, and the proposition is wrong. This is the hardest layer and the one that misses most often in production.
Jurisdiction. Is the cited authority binding (or at least persuasive) for the matter at hand? A 9th Circuit case is irrelevant for a Southern District of New York filing on a federal question. A Texas Supreme Court case is irrelevant for a California state law dispute. This is mid-difficulty: easy to check mechanically once you know the matter's jurisdiction, easy to miss if your retrieval pipeline does not track it.
A complete grounding eval runs all three and reports them separately. Aggregating into a single score hides which layer is failing, which matters because the fixes are different.
Building the golden dataset
The eval is only as good as the dataset behind it. A few principles for legal grounding datasets.
Source the cases yourself
Do not generate the dataset by asking GPT-4 to make up legal questions. The questions and answers will inherit the model's biases and miss the long-tail cases that hallucinations actually happen on. Pull real questions from real practice areas. Bar exam questions, casebook hypotheticals, BigLaw associate research memos, public ALI restatement commentary. Mix common questions with deliberately obscure ones.
Caselaw Access Project (Harvard) and CourtListener (Free Law Project) are the workhorse public sources. Both provide bulk APIs and indexed search. For paid sources, Westlaw and Lexis APIs cover what the public sources miss. Build your dataset across all of them.
Include the failure modes deliberately
A useful eval set is not a representative sample of legal questions. It is a stratified sample that overweights the failure modes you want to catch. The structure I recommend:
- Easy positives. Famous cases that any decent legal AI should get right. Marbury v. Madison for judicial review, Erie for choice of law in diversity. These are sanity checks.
- Long-tail positives. Real but obscure cases that test whether retrieval actually works. Pull from BigLaw practice memos in niche areas: ERISA fiduciary breach, bankruptcy preference exceptions, FCC spectrum auction rules.
- Adversarial misalignment. Questions where there exists a real case with a name that sounds on-point but holds the opposite. The model that pattern-matches case names will fail these. The model that actually reads the holdings will pass.
- Adversarial fabrication. Questions where you have engineered the prompt to push the model toward inventing a case (asking for a holding that does not exist anywhere in the corpus).
- Jurisdiction traps. Questions where the on-point case is from the wrong jurisdiction. The right answer is "no binding authority in this jurisdiction" or to cite the persuasive authority with appropriate qualifying language.
50 cases per stratum is enough to start. Scale to 200 to 500 once the eval pipeline is working. Beyond that you start hitting diminishing returns on dataset diversity.
Annotation requires a lawyer
Every entry in the golden dataset needs a ground-truth annotation: which citations are correct, which holdings each cite supports, which jurisdiction the question lives in. This annotation cannot be done by an engineer alone. Get a lawyer (junior associate, contract attorney, or law school RA) to spend the time. Their hourly rate is worth it. A wrong gold annotation propagates through every eval run after, and a contaminated golden set is worse than no golden set.
The Harvey BigLaw Bench is a useful reference here. Harvey publicly noted in 2025 that they scrapped their proprietary fine-tuned legal model after frontier models started outperforming it on their own internal benchmark. The benchmark mattered more than the model. That is the position you want to be in: your eval is the ground truth, and any model is a candidate.
The existence check
This is the simplest layer. Implementation is a lookup function against the canonical legal databases. A few practical notes.
Citations come in many formats. Roe v. Wade, 410 U.S. 113 (1973). 410 U.S. 113. Roe, 410 U.S. at 116. The model emits all of these. Your existence check has to normalize them all to a canonical key (volume, reporter, page) before lookup. Reuters' eyecite library is the standard tool for this; integrate it into your post-processing.
Once normalized, the lookup is fast. Westlaw and Lexis both have API endpoints that take a citation and return either the case metadata or a 404. CourtListener has a public REST API for the same. Caselaw Access Project has a bulk export you can index locally if you want to avoid per-query API costs.
# Existence check skeleton
import eyecite
from legal_db import westlaw_lookup
def check_existence(output_text: str) -> dict:
citations = eyecite.get_citations(output_text)
results = {"valid": [], "fabricated": []}
for cite in citations:
normalized = (cite.volume, cite.reporter, cite.page)
match = westlaw_lookup(normalized)
if match:
results["valid"].append((cite, match))
else:
results["fabricated"].append(cite)
return resultsExistence check failures are usually the easiest hallucinations to fix at the model layer (better retrieval, refusal training, character-level grounding). They are also the most embarrassing in production, because they look like the model just made up a case name. Catch them.
The alignment check
This is the layer where most production legal AI is weakest, and where the most damaging failures live. The cite is real. The Bluebook format is right. The jurisdiction matches. The holding does not say what the brief claims it says.
The alignment check requires reading the actual content of the cited authority and judging whether it supports the asserted proposition. There is no API for this. You need an LLM-as-judge.
Judge prompt design
A bad alignment judge is one that reads the proposition and the case excerpt charitably and looks for any way the case could support the claim. This is the same fluency-truth bias that produces hallucinations in the first place. You will get false negatives (judge passes a misaligned cite).
A good alignment judge is skeptical by default, requires direct textual support, and outputs structured rationale.
You are a senior litigator reviewing a junior associate's brief for misuse of authority.
CLAIM: {asserted_proposition}
CITATION: {citation}
CASE EXCERPT: {retrieved_case_text}
Your task: determine whether the case excerpt provides DIRECT, EXPLICIT support for the claim. The standard is strict. The case must contain language that a court would accept as supporting the proposition. Inferring, extending, or analogizing from the case is NOT direct support and counts as misuse.
Respond in JSON:
{
"supports_directly": true | false,
"rationale": "<2-3 sentences>",
"supporting_text": "<exact quote from the case excerpt that supports the claim, or null>",
"misuse_type": null | "overextension" | "wrong_holding" | "dictum_treated_as_holding" | "factually_distinguishable"
}A few notes on this prompt. The "senior litigator reviewing a junior associate" framing biases the judge toward skepticism, which is what you want. Requiring an exact supporting quote forces grounding in the source text rather than the judge's prior knowledge. Categorizing the misuse type in the failure case gives you structured data to analyze later (are most failures overextension, or wrong-holding?).
Calibrate the judge against your golden dataset. Run it on the misalignment stratum, count how many it catches, count its false-positive rate against the easy-positive stratum. Iterate until you get past 90 percent recall on misalignment with under 5 percent false positive rate on real positives. Below those thresholds, your judge is not yet useful as a production eval.
Judge model selection
Use a strong model for the judge. The misalignment failure mode is exactly the kind of subtle reasoning task where weaker models hallucinate. Claude Opus or GPT-5 class is appropriate; cheaper models will give you noisy alignment scores.
The cost is real. If you grade 1,000 generations and each has 4 citations, that is 4,000 judge calls. At Opus prices this is meaningful. The optimization: cache judge results on the (claim, citation) pair, since the same cite supporting the same claim never needs to be re-judged. In practice this drops cost dramatically once your eval set stabilizes.
The jurisdiction check
Mechanically simple but production teams often miss it. Every matter has a jurisdiction (or a set of jurisdictions for choice-of-law cases). Every citation has a court. The check is: is the cited court's decision binding or persuasive in the matter's jurisdiction?
Build a small lookup table that encodes the basic structure: federal hierarchy (Supreme Court binds all, Circuit binds districts within it), state hierarchy (state supreme court binds lower state courts within the state), cross-jurisdiction persuasion (other circuits are persuasive but not binding for the federal question). For the binding/persuasive distinction, the table needs to know the legal question type (federal versus state law).
def check_jurisdiction(citation, matter_jurisdiction, question_type):
cited_court = citation.court
if question_type == "federal" and cited_court == "scotus":
return "binding"
if question_type == "federal" and cited_court.startswith("ca") and matter_jurisdiction.startswith("d"):
if same_circuit(cited_court, matter_jurisdiction):
return "binding"
return "persuasive"
# ... continue for state hierarchies, cross-state, federal question in state courtThe output of this check is one of three labels per cite: binding, persuasive, or irrelevant. The eval scores irrelevant cites as failures only if the brief presents them as binding without qualifying language. Persuasive cites without "this court has not addressed but the 9th Circuit has held" framing also count as failures, though softer.
Wiring it together
The full eval runs all three checks on every output and reports the pass rate per stratum.
# Citation grounding eval, three-tier
import respan
@respan.eval(name="citation-existence")
def existence_eval(trace):
result = check_existence(trace.output.text)
return {
"score": len(result["valid"]) / max(len(result["valid"]) + len(result["fabricated"]), 1),
"fabricated_count": len(result["fabricated"]),
}
@respan.eval(name="citation-alignment")
def alignment_eval(trace):
output = trace.output
if not output.citations:
return {"score": None, "skipped": True}
judges = [alignment_judge(c, output.claim) for c in output.citations]
supported = [j for j in judges if j["supports_directly"]]
return {
"score": len(supported) / len(judges),
"misuse_breakdown": Counter(j["misuse_type"] for j in judges if not j["supports_directly"]),
}
@respan.eval(name="citation-jurisdiction")
def jurisdiction_eval(trace):
output = trace.output
matter = trace.metadata["matter_jurisdiction"]
question_type = trace.metadata["question_type"]
correct = 0
for cite in output.citations:
binding_status = check_jurisdiction(cite, matter, question_type)
if binding_status != "irrelevant":
correct += 1
elif "this court has not addressed" in trace.output.text:
correct += 1
return {"score": correct / max(len(output.citations), 1)}
# Run all three on the golden dataset, broken out by stratum
results = respan.evaluate(
target=legal_research_agent,
dataset="legal-grounding-golden-v1",
evals=[existence_eval, alignment_eval, jurisdiction_eval],
group_by=["stratum"],
)The group_by="stratum" is the important piece. You want to see the existence pass rate on adversarial fabrication separately from the easy positives. A model that scores 95 percent on existence overall but 60 percent on the adversarial fabrication stratum has a real problem hidden by the aggregate.
Continuous eval capture
The eval set is not static. The single highest-leverage move is to wire production failures back into the dataset.
When a lawyer rejects a citation in production (clicks "this is wrong" or edits the cite out of the draft), capture that interaction. The original query, the retrieved context, the generated output, the rejected cite, the lawyer's edit if any. All of it goes into a "production failures" dataset that you re-run on every model upgrade and prompt change.
# Capture rejected citations as eval cases
respan.datasets.append_from_traces(
name="legal-grounding-prod-failures",
filter={"feedback_signal": "citation_rejected"},
window="last_30d",
annotations_required=["correct_citation"], # routed to attorney annotators
)
# Run on every model release
@respan.experiment(trigger="on_model_change")
def grounding_regression_check(model):
return respan.evaluate(
target=legal_research_agent.with_model(model),
dataset="legal-grounding-prod-failures",
evals=[existence_eval, alignment_eval, jurisdiction_eval],
)This is the loop that catches the regression Stanford RegLab specifically warned about: legal AI tool performance fluctuates across model updates in unpredictable ways. The frozen golden dataset catches the regression you can predict. The continuously-captured production failures catch the regressions you cannot predict, because they encode the actual ways your users break the system.
What to ship and in what order
A staged rollout that gets you to a defensible eval in roughly four weeks of engineering time:
- Week 1. Existence check against your retrieval corpus or a public legal database. Wire it into your tracing layer so every production output gets scored automatically.
- Week 2. Build the golden dataset. 50 cases per stratum, lawyer-annotated. This is the highest-impact week and the one most often skipped.
- Week 3. Implement the alignment judge. Calibrate against the golden set until recall on misalignment is above 90 percent.
- Week 4. Jurisdiction check, continuous capture, and CI integration. Eval runs on every prompt or model change before deploy.
The week 2 golden dataset is what separates teams that have a real eval from teams that have a vibe check. Skip weeks 1, 3, or 4 and you have a partial eval. Skip week 2 and you have nothing useful.
How Respan fits
A citation grounding eval is only as good as the substrate that runs it. Respan is the tracing, eval, gateway, and prompt layer underneath, so existence, alignment, and jurisdiction checks all run against the same captured production traffic that your lawyers actually flagged.
- Tracing: every legal research generation captured as one connected trace, including retrieval, reranking, the LLM-as-judge call, and the final cited output. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a cite gets rejected in production, you can replay the entire chain that produced it and see whether existence, alignment, or jurisdiction was the failing layer.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on fabricated citations, misaligned holdings, and wrong-jurisdiction cites before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. The alignment judge runs on Claude Opus or GPT-5 class models through the gateway, with caching on the (claim, citation) pair so repeat judgments do not blow up cost.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. The skeptical senior-litigator judge prompt, the retrieval rewriter, and the refusal-with-qualifying-language template all belong in the registry so changes are tracked and reversible.
- Monitors and alerts: existence pass rate per stratum, alignment recall on the misalignment stratum, fabricated-cite count per 1,000 outputs, jurisdiction-irrelevant rate, lawyer rejection rate. Slack, email, PagerDuty, webhook. Alert when adversarial fabrication recall drops below 90 percent after a model update.
A reasonable starter loop for legal AI builders:
- Instrument every LLM call with Respan tracing including retrieval spans, judge spans, and the final structured citation list.
- Pull 200 to 500 production legal research outputs into a dataset and label them for existence, alignment, and jurisdiction correctness.
- Wire two or three evaluators that catch the failure modes you most fear (fabricated citations, overextension of real holdings, wrong-jurisdiction cites presented as binding).
- Put your alignment judge prompt, retrieval rewriter, and refusal template behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so judge calls cache on (claim, citation) pairs and frontier models stay swappable as the next Opus or GPT class lands.
Without this loop, your firm ships a hallucinated cite into a real brief, pays sanctions, and pulls the tool.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- Why Legal AI Still Hallucinates Citations: the failure modes this eval catches
- Building an AI Contract Review Agent: applying grounding eval to contract review
- ABA Formal Opinion 512 for Engineers: why audit trails matter for evals
- How Legal AI Teams Build LLM Apps in 2026: pillar overview
Get started. Download the Legal Citation Grounding Starter, which includes a 100-case sample golden dataset, the alignment judge prompt above with calibration data, and a Respan eval pipeline you can adapt to your retrieval corpus. To talk through eval architecture for your specific legal AI product, book a call.
