Spellbook has 4,000+ legal teams across 80 countries. Ironclad's Jurist agent is shipped to AmLaw 100 firms. Harvey's Workflow Agents process more than 400,000 agentic queries per day. GC AI, Definely, LegalOn, Ivo, LEGALFLY, and Luminance Eve all compete in the same in-house contract review category.
The market is large, the architecture has stabilized, and the engineering challenges are well-documented. If you are building a contract review agent in 2026, you do not need to invent the architecture. You need to build it correctly: clause segmentation that does not lose context, playbook enforcement that produces consistent redlines, citation grounding that ties every suggestion to a source, and an eval pipeline that catches regressions before they reach lawyers.
This post is a complete walkthrough. It assumes you are starting from scratch (or replacing a prototype that is showing its limits in production) and want a defensible architecture you can ship to in-house counsel without embarrassing yourself or them. Code samples use the Respan SDK; the patterns translate to other observability stacks.
What "contract review" actually means
Before architecture, scope. "AI contract review" covers at least four different workflows that share infrastructure but have different evaluation targets.
Pre-execution review is what Spellbook and GC AI optimize for: the lawyer (or business stakeholder) receives a counterparty's draft and needs to redline it against firm or company standards. Output is a marked-up document with suggested edits, rationale, and risk flags. This is the highest-volume use case and the most common target for new builds.
Post-execution analysis is what Kira Systems and Luminance pioneered: ingest a portfolio of executed contracts, extract structured data (parties, obligations, dates, governing law, indemnification scope), and surface patterns or risks. Output is structured fields and a queryable database.
Drafting from scratch is what Spellbook Associate and Harvey's Drafting Agent target: given a deal type and key terms, generate a first draft. Output is a complete document.
Cross-document consistency is the newer category that GC AI's Projects and Harvey's Vault address: review a counterparty draft against existing executed contracts (existing DPAs, MSAs, side letters) to catch conflicts. Output is conflict identification with specific clause references.
This post focuses on pre-execution review because it is the highest-volume entry point and most of the patterns generalize to the others. I will note the differences at relevant points.
Architecture overview
The minimum viable architecture for pre-execution contract review:
[Counterparty Contract (.docx)]
|
v
[Document parsing + clause segmentation]
|
v
[Playbook retrieval (RAG over firm/company standards)]
|
v
[Per-clause analysis agent]
| - compare against playbook
| - assess risk
| - generate redline + rationale
v
[Citation grounding (every redline cites playbook source)]
|
v
[Multi-clause consistency check (cross-clause conflicts)]
|
v
[Output: marked-up document + risk summary]
|
v
[Lawyer review + accept/reject signals]
|
v
[Continuous eval capture]
Each block has subtleties that matter in production. The next sections walk through each.
Document parsing and clause segmentation
The single biggest source of subtle bugs in contract review systems is bad clause segmentation. If your segmenter splits a clause across two chunks, the downstream analyzer sees half the meaning and produces wrong redlines. If it merges two clauses into one chunk, you lose granularity in the output. Both failure modes are common and both are hard to catch in eval unless you specifically test for them.
The standard approach has three layers.
Format-aware parsing. Use a document parser that preserves structure: headings, numbered lists, bullets, definitions, defined terms. python-docx handles .docx; for PDFs you need OCR plus layout-aware parsing (Unstructured, Reducto, or LlamaParse). Do not flatten the document to plaintext before segmenting; the structural cues are how clauses are identified.
Heuristic segmentation by structural markers. Most contracts have explicit clause boundaries: numbered section headers (1.1, 1.2, 1.3), bold captions ("Indemnification", "Limitation of Liability"), or all-caps titles. A regex pass over the parsed document gets you 80 percent of the way to correct segmentation.
LLM-assisted refinement. The hard cases are nested clauses (a sub-provision that is functionally its own clause), embedded definitions (a sentence in a recitals section that defines a term used throughout), and continuations (a clause whose meaning depends on the prior section's preamble). For these, a small LLM pass that takes the heuristic segmentation and asks "did this segmenter cut a clause in half" catches most of the misses.
# Clause segmentation pipeline
import respan
@respan.trace
def segment_contract(contract_path: str) -> list[Clause]:
parsed = parse_docx(contract_path)
initial_clauses = heuristic_segment(parsed)
refined = llm_refine_segmentation(initial_clauses, parsed)
return refinedTrace the segmentation step. When you find a downstream redline that looks wrong, the first thing to check is whether the segmenter correctly identified the clause it was analyzing. Tracing makes this a five-second answer instead of a 30-minute reproduction.
Playbook ingestion
The playbook is the firm's or company's standard. It can be a structured document (a clause library with preferred and acceptable language for each clause type), a set of executed contracts treated as exemplars, or both. Modern systems use both.
Structured clause library. A YAML or database structure that, for each clause type, includes:
- Preferred language
- Acceptable variations
- Unacceptable patterns
- Risk level (high/medium/low) for deviations
- Negotiation guidance ("we can drop this for deals under $X")
This is what your lawyers maintain. It needs an editor UI that is not the engineer's IDE.
Exemplar corpus. A collection of past executed contracts of the same deal type. Spellbook calls this Compare to Market and uses a database of 200,000+ real-world agreements. For a new product, you can seed with executed contracts from your design partners. The exemplar corpus answers the question "what does this clause usually look like in deals like this," which is different from "what is our preferred language."
The playbook is retrieved per-clause: for each clause in the counterparty draft, retrieve the matching playbook entry plus 3 to 5 exemplars of the same clause type. This is RAG with a specific structure: the retrieval key is the clause type (indemnification, governing law, IP assignment, etc.), not arbitrary similarity.
Clause-type classification is a separate step. A small fine-tuned classifier or a few-shot LLM prompt assigns each segmented clause to one of your taxonomy categories. Get the taxonomy right early; rebuilding it later means re-annotating the exemplar corpus.
@respan.trace
def retrieve_playbook(clause: Clause) -> PlaybookContext:
clause_type = classify_clause_type(clause)
standard = playbook_db.get(clause_type)
exemplars = exemplar_db.search(clause_type, k=5)
return PlaybookContext(
clause_type=clause_type,
standard=standard,
exemplars=exemplars,
)Per-clause analysis
For each clause in the counterparty draft, the analyzer takes the clause text and the playbook context and produces:
- A risk assessment (low/medium/high)
- A list of issues identified
- Suggested redline text
- Rationale for each suggestion, citing the playbook source
This is where prompt design earns its keep. A few patterns that work:
Structured output, not prose. Use a JSON schema or an Anthropic tool-call format that forces the model to output specific fields. Free-form prose is harder to parse, harder to evaluate, and tends to produce more hallucinated rationale.
Cite the playbook explicitly. In the prompt, instruct the model that every issue and every suggestion must reference a specific section of the playbook. Reject outputs where this is missing.
Show the standard, the exemplars, and the counterparty clause side by side. Do not concatenate them in the prompt; structure them with explicit headers. The model produces more consistent comparisons when the comparison structure is given.
@respan.trace
def analyze_clause(clause: Clause, context: PlaybookContext) -> ClauseAnalysis:
prompt = build_analysis_prompt(
clause_text=clause.text,
standard=context.standard,
exemplars=context.exemplars,
)
response = gateway.complete(
model=select_model_for_clause_type(context.clause_type),
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_schema", "schema": ClauseAnalysis.schema()},
)
return ClauseAnalysis.parse(response.content)The select_model_for_clause_type dispatch is where cost optimization lives. NDA review with a standard playbook does not need Opus; Haiku-class models perform indistinguishably on clauses with low ambiguity. Complex M&A indemnification clauses with multi-party indemnitee structures need the strongest model you have. Routing by clause complexity drops your average cost per contract by 5 to 10 times without measurable quality loss.
Citation grounding for redlines
Every issue and every redline suggestion needs to cite a specific source. This is how the lawyer trusts the suggestion: not "the AI thinks this clause is risky" but "this clause is risky because it deviates from playbook section 4.2 in the following way."
The grounding has to be character-level, not at the document level. "Playbook section 4.2" is not enough; the lawyer needs to see exactly which sentence of section 4.2 the model is referencing. This is the same character-level citation pattern that legal research agents use, applied here to playbook references.
Implementation:
- The playbook is indexed with character offsets per sentence
- The analyzer prompt is structured so the model emits citation tokens that reference offsets
- A post-processing step validates each citation actually maps to a real offset in the playbook
- Citations that fail validation flag the clause for human review (do not silently drop them; surface them so the lawyer knows the model wanted to cite something but could not)
The same pattern applies to citing exemplars. "Five out of seven recent deals included this carveout" is a useful claim, but only if you can show the lawyer which seven deals.
Multi-clause consistency check
A clause-by-clause review misses cross-clause conflicts. The indemnification clause references "Confidential Information" but the definitions section limits "Confidential Information" to written disclosures only; the indemnification scope is therefore narrower than the lawyer expected. The governing law is California but the dispute resolution is JAMS arbitration in New York. The IP assignment carves out background IP but the deliverables clause assumes all created IP belongs to the counterparty.
These conflicts are common, they are exactly the kind of error a junior associate also misses, and they are precisely where AI review can add value above and beyond clause-level checks.
The check is a second pass over the analyzed contract:
@respan.trace
def consistency_check(analyses: list[ClauseAnalysis]) -> list[Conflict]:
# Build a structured representation of the contract's commitments
structured = build_obligation_graph(analyses)
# Run conflict detection rules + LLM-assisted check for novel conflicts
rule_based = check_known_conflict_patterns(structured)
llm_based = llm_consistency_check(structured)
return merge_conflicts(rule_based, llm_based)Rule-based conflict patterns catch the well-known cases (governing-law/dispute-resolution mismatch, definition scope mismatch, payment-term inconsistency). The LLM pass catches novel conflicts that have not been encoded as rules. Both should run; rule-based is cheap and catches 60 percent of conflicts, LLM-based is more expensive and catches the rest.
Eval setup
This is where most contract review systems are weakest. They have a demo that looks good and a roadmap that says "improve quality." That is not an eval.
A proper eval suite for contract review has three layers.
Clause-level rubric eval. For each clause type in your taxonomy, a rubric specifying what a correct review looks like. Did the model identify the playbook deviation? Did it correctly classify the risk level? Did the redline text preserve the deal structure while addressing the concern?
This is annotated by lawyers on a golden dataset of 200 to 500 contracts, and it is the workhorse of your CI. Every prompt change, every model upgrade, this eval runs.
End-to-end eval. A smaller set (50 to 100 contracts) where you measure outputs against full lawyer-produced reviews. The metric is alignment: of the issues a senior lawyer flagged, what percentage did the AI flag? Of the issues the AI flagged, what percentage did the lawyer agree were real issues?
This catches the failure modes the clause-level rubric misses: false positives, missed cross-clause conflicts, redlines that are technically correct but commercially unreasonable.
Production capture eval. Every lawyer's accept/reject signal in production goes into an evolving dataset. The interesting cases are the rejected suggestions: why did the lawyer reject this redline? Was it a hallucination, a context miss, an over-aggressive risk flag, or a legitimate disagreement? Annotated rejection reasons feed back into your prompt and playbook iteration.
# Eval suite for contract review
@respan.eval(name="clause-rubric")
def clause_rubric_eval(trace, gold):
analysis = trace.output
return {
"issues_recall": recall(analysis.issues, gold.issues),
"issues_precision": precision(analysis.issues, gold.issues),
"risk_level_correct": analysis.risk_level == gold.risk_level,
"redline_quality": llm_judge_redline(analysis.redline, gold.redline),
}
@respan.eval(name="grounding")
def grounding_eval(trace):
citations = extract_playbook_citations(trace.output)
fabricated = [c for c in citations if not playbook_lookup(c)]
return {"valid_rate": 1 - len(fabricated) / max(len(citations), 1)}
# Run on every prompt or model change
respan.evaluate(
target=contract_review_agent,
dataset="contract-review-golden-v3",
evals=[clause_rubric_eval, grounding_eval],
group_by=["clause_type", "deal_type"],
)The group_by is essential: aggregate metrics hide the truth. You want to see indemnification clause performance separately from governing law clause performance, because the failure modes and the fixes are different.
Production observability
In production, every contract review request produces a trace. The trace captures:
- Original document and parsed structure
- Each segmented clause with its boundaries
- Each clause's classification and retrieved playbook context
- The full prompt sent to the model (for each clause)
- The raw model response
- The parsed analysis (issues, risk, redline, rationale, citations)
- Citation grounding verification status
- Cross-clause conflict check results
- Final output presented to the lawyer
- Lawyer accept/reject decisions per suggestion
This is necessary for three reasons.
Debugging. When a lawyer flags a bad output, you can reconstruct exactly what happened: was the segmentation wrong, the classification wrong, the playbook retrieval wrong, the analysis prompt confused, or did the model just hallucinate?
Audit trail. ABA Formal Opinion 512 supervisory and candor obligations require lawyers to be able to explain how AI was used. Your tracing layer is the data backbone for that explanation.
Continuous eval. Production traces feed back into the eval set. The interesting traces (high disagreement between AI and lawyer, high-stakes deals, novel clause patterns) get annotated and become permanent eval cases.
In Respan:
import respan
@respan.workflow(
name="contract-review",
matter_id_param="matter_id",
attorney_param="attorney_id",
)
def contract_review_agent(contract_path, playbook_id, matter_id, attorney_id):
clauses = segment_contract(contract_path)
analyses = []
for clause in clauses:
context = retrieve_playbook(clause, playbook_id)
analysis = analyze_clause(clause, context)
ground_citations(analysis)
analyses.append(analysis)
conflicts = consistency_check(analyses)
return ContractReview(analyses=analyses, conflicts=conflicts)The workflow decorator gives you the trace tree, the matter and attorney attribution, and the timing breakdown. You can query: which contracts took longest to analyze, which clause types had the most retries, which lawyers disagreed with the AI most often.
Cost and latency
A naive implementation of this architecture costs $5 to $10 per contract on Opus-class models, with latencies of 60 to 120 seconds for medium-complexity agreements. Both numbers can be reduced significantly with the right routing.
Per-clause model selection. As mentioned earlier: route NDAs and standard MSAs to Haiku, route M&A and credit agreements to Opus. A clause complexity classifier (small fine-tuned model or a heuristic based on length, defined-term density, and cross-reference count) decides per-clause. Drops average cost 5 to 10x.
Batch parallelization. Per-clause analyses are independent and can run in parallel. The 60-second sequential review becomes 8 seconds parallel for a 30-clause contract. Watch for rate limits with your model provider.
Playbook caching. Playbook retrieval results are highly cacheable: the same playbook section is referenced thousands of times. A simple LRU cache on the retrieval layer cuts retrieval cost dramatically.
Streaming for perceived latency. Even if total wall-clock is 30 seconds, streaming results clause-by-clause makes the UX feel responsive. The lawyer can start reviewing the first clauses while later clauses are still processing.
# Routing pattern in the gateway
from respan.gateway import Gateway
gateway = Gateway(
routing_rules=[
("nda", "anthropic/claude-haiku"),
("standard_msa", "anthropic/claude-haiku"),
("m_and_a_indemnification", "anthropic/claude-opus"),
("complex_credit_agreement", "anthropic/claude-opus"),
("default", "anthropic/claude-sonnet"),
],
fallback_chain=["openai/gpt-5", "google/gemini-pro"], # if primary unavailable
zdr_only=True,
)The fallback chain matters in production. When Anthropic has an incident, your contract review tool should keep working (after notifying the user that fallback is active). Hard-coding to one provider is a single point of failure that ABA 512 supervisory obligations make uncomfortable.
What to build first
A four-week MVP plan that produces a defensible product:
Week 1. Document parsing and clause segmentation. Build the heuristic segmenter, validate on 20 contracts, measure error rate. Tracing layer in place from day one.
Week 2. Playbook ingestion. Build the structured clause library editor (yes, with a UI; engineers should not be the only ones who can update it). Seed the exemplar corpus with 50 to 100 design partner contracts.
Week 3. Per-clause analysis with citation grounding. End-to-end pipeline producing redlines for one clause type (indemnification is a good first target because the failure modes are visible).
Week 4. Eval suite. 100-contract golden dataset, lawyer-annotated. CI that runs the eval on every prompt change. Cross-clause consistency check is a stretch for week 4; if it does not fit, defer to week 5.
The clause-by-clause analysis with grounding is the heart. Cross-clause consistency is a differentiator that comes after. Eval has to be in place by the end of month one or you will ship regressions you do not catch.
What separates the good from the demo
After watching this category for two years, the differentiation between products that lawyers actually use and products that get demoed and abandoned reduces to a few things:
- Does the tool stay in the lawyer's existing surface (Microsoft Word for Spellbook, Outlook for some, the matter management system for others), or does it force a context switch?
- Are the redline suggestions specific enough to accept directly, or do they require rewriting?
- When a suggestion is wrong, can the lawyer see why the AI thought it was right (the trace)?
- Does the tool scale across the firm's playbook diversity (different practice groups, different deal types, different counterparty patterns), or does it only work for one shape of contract?
- Does it give the firm the audit trail it needs for ABA 512 compliance?
The architecture above supports all of these. Built without them, your product becomes one of the dozens that get demoed in security review and dropped in week three of the pilot.
How Respan fits
Contract review agents combine clause segmentation, playbook RAG, per-clause analysis, citation grounding, and cross-clause consistency checks into one pipeline. Respan is the substrate underneath all of it: tracing, evals, gateway routing, prompt registry, and monitors that match the way contract review actually breaks in production.
- Tracing: every clause analysis captured as one connected trace, from document parse through segmentation, playbook retrieval, per-clause prompt, citation grounding, and cross-clause check. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a lawyer flags a bad redline, you can reconstruct in seconds whether the segmenter cut the clause in half, the classifier picked the wrong clause type, or the model hallucinated a playbook citation.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on missed indemnification carveouts, fabricated playbook citations, wrong risk-level classifications, and bad redline text before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Route NDAs and standard MSAs to Haiku, route M&A indemnification and complex credit agreements to Opus, and fail over to GPT-5 or Gemini when Anthropic has an incident so contract review keeps shipping.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. The per-clause analysis prompt, the clause-type classifier few-shot prompt, the segmentation refinement prompt, and the LLM consistency check prompt all belong in the registry so lawyers and prompt engineers can iterate without a deploy.
- Monitors and alerts: citation grounding validation rate, lawyer accept/reject ratio per clause type, average cost per contract, p95 review latency, fallback chain activation count. Slack, email, PagerDuty, webhook. A drop in grounding rate or a spike in rejections on indemnification clauses pages whoever owns the playbook before the next pilot call.
A reasonable starter loop for contract review builders:
- Instrument every LLM call with Respan tracing including segmentation, clause classification, playbook retrieval, per-clause analysis, and citation validation spans.
- Pull 200 to 500 production clause analyses into a dataset and label them for issue recall, redline quality, risk-level correctness, and citation validity.
- Wire two or three evaluators that catch the failure modes you most fear (fabricated playbook citations, missed cross-clause conflicts between governing law and dispute resolution, over-aggressive risk flags on standard market language).
- Put your per-clause analysis and clause-type classifier prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so cheap clauses run on Haiku, complex M&A clauses run on Opus, and incidents at any one provider do not take your tool offline during a pilot.
Without this loop, you ship a demo that looks good in security review and gets dropped in week three because the redlines drift, the citations fabricate, and no one can explain to the partner why the AI flagged what it flagged.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- Why Legal AI Still Hallucinates Citations: the failure mode this architecture defends against
- Building a Citation Grounding Eval for Legal AI: eval methodology in depth
- ABA Formal Opinion 512 for Engineers: the compliance layer this architecture has to support
- How Legal AI Teams Build LLM Apps in 2026: pillar overview
Get the starter. Download the Contract Review Reference Architecture, a complete reference repository with clause segmentation, playbook RAG, citation grounding, and the full eval suite described above. To talk through your specific architecture decisions, book a call.
