Real estate is unforgiving on factual accuracy. An LLM hallucinating "2,400 square feet" when the MLS says 2,100 does not just embarrass the listing agent. It exposes the brokerage to misrepresentation claims and can void a transaction. A copilot that recommends comps from the wrong submarket misprices the seller's expectations and costs them weeks. A search experience that hallucinates "open concept" on a closed-floor-plan listing erodes trust the moment the buyer walks through the door.
This piece is for engineers building real estate AI products who need to drive property-fact hallucination rates down. It covers the failure modes that show up in production, why they are structurally harder than general-domain hallucinations, and the six engineering fixes that close the gap.
For the wider Real Estate cluster, see the pillar, the Fair Housing compliance spoke, the agent copilot build walkthrough, and the eval spoke.
The failure modes that show up in production
Five property-fact hallucination patterns recur across real estate AI products.
Square footage drift. The model emits a square footage number that does not match the MLS, the assessor record, or the floor plan. The number sounds plausible (it falls in the realistic range for the property type), so casual review misses it.
Year-built and renovation claims. The model claims "renovated in 2022" when the MLS only shows "remodeled kitchen" with no year. Or claims "built in 1985" when the assessor shows 1987. These small errors compound into liability when buyers rely on them.
Feature hallucination. "Open concept floor plan" on a closed layout. "Luxury finishes" on a rental-grade kitchen. "Stainless steel appliances" when the photos show white. The model fills in feature claims based on training data priors rather than the actual listing data.
Comp drift. The model selects comps from the wrong submarket, the wrong property type, or the wrong time window. A 2,400 sqft single-family in a particular zip code should not be compared to a 2,400 sqft townhome two zip codes over, but a naive retrieval pipeline will return both as similar.
AVM explanation drift. The structured AVM model produces a defensible estimate, but the LLM explanation overstates the confidence or extrapolates beyond what the comps support. "This home is worth $X because of recent improvements" when no improvements are documented.
Why this is structurally hard
Five reasons real estate hallucinations are harder to suppress than general-domain ones.
The data live in many sources. MLS, county assessor, tax records, flood maps, school boundaries, HOA filings, permit records. Each source has a different schema, freshness, and reliability. A naive RAG pipeline that pulls from one source misses the ground truth in another.
Listing data is incomplete. MLS records do not capture every property attribute. The agent describes "open kitchen" in the remarks; the structured fields just say "kitchen." Generative AI fills the gap by inference, which is exactly where hallucinations enter.
The training data is biased toward marketing copy. Most LLM training data on real estate is Zillow descriptions, Redfin marketing, and listing descriptions written to sell. The model has learned that "luxury finishes" and "open concept" are common phrases, regardless of whether they apply to the specific property.
Comp similarity is not embedding similarity. Two homes can have similar text descriptions and embedding distance and still be poor comps because of submarket boundaries, school districts, lot characteristics, or condition differences that are not captured in the listing text.
Time-series matters. A comp from six months ago in a fast-moving market is stale. An AVM trained on pre-rate-hike data is wrong post-rate-hike. The model's training cutoff and the user's current market are not the same.
Six engineering fixes
The combination means you cannot solve real estate hallucinations at any single layer. Defense in depth.
1. MLS-grounded retrieval with citation enforcement
Every property fact in your output must trace to a specific MLS field, county record, or other authoritative source. Free-form generation without this grounding produces fluent fabrications.
Implementation:
- Structured retrieval first. For any property-specific question, retrieve the canonical structured record (MLS, assessor) before any LLM call.
- Citation-required generation. Prompt the model to emit citations for every fact:
{"fact": "2,100 sqft", "source": "mls.field.living_area"}. - Post-generation validation. Validate that every cited source resolves to a real field with the cited value. Drop or flag claims that fail.
from respan import Respan
client = Respan(api_key=os.environ["RESPAN_API_KEY"])
@client.workflow(name="property-description")
def generate_description(listing_id):
mls = mls_client.get(listing_id)
assessor = assessor_client.get(mls.parcel_id)
response = client.chat.completions.create(
model="auto",
messages=build_grounded_prompt(mls, assessor),
response_format={"type": "json_schema", "schema": GroundedDescription.schema()},
)
# Validate every cited fact
for claim in response.claims:
source = resolve_source(claim["source"], mls, assessor)
if source is None or source != claim["fact"]:
claim["faithfulness"] = "fail"
return response2. Photo grounding for visual claims
For feature claims that are visual ("stainless appliances", "hardwood floors"), use a vision model to verify against the listing photos. The structured MLS fields rarely capture features at this granularity, so vision is the verification layer.
The pattern: feature claims in the description must be either (a) tied to a structured MLS field or (b) verified against an annotated photo region with reasonable confidence. Claims that fail both checks get flagged.
3. Comp retrieval with explicit submarket boundaries
Naive embedding-similarity retrieval over MLS sold listings produces bad comps. The fix is hybrid retrieval with hard filters:
- Geographic filter. Same zip code, then same school district, then same neighborhood polygon.
- Property-type filter. Single-family does not comp to townhome to condo, even at similar square footage.
- Time-window filter. Comps from the last 90 days for fast-moving markets, 180 for stable, 365 only when above filters return less than 5 results.
- Condition filter. Renovated does not comp to original-condition, even at similar size and age.
- Distance reranking. Among candidates passing the filters, sort by physical distance and structural similarity.
The LLM never picks comps from a vector search blindly. The pipeline picks the comps; the LLM explains them.
4. AVM grounding with confidence bands
The valuation comes from the statistical AVM, not the LLM. The LLM's job is to translate the AVM output into a human-readable explanation with appropriate hedging.
def generate_valuation_explanation(listing_id):
avm = avm_model.predict(listing_id)
comps = retrieve_comps(listing_id, k=10)
explanation = client.chat.completions.create(
model="auto",
messages=[
{"role": "system", "content": GROUNDED_VALUATION_PROMPT},
{"role": "user", "content": f"""
AVM estimate: ${avm.estimate:,}
Confidence band (80%): ${avm.lower:,} to ${avm.upper:,}
Comps: {format_comps(comps)}
Explain the estimate. Do not assert any fact not in the comps.
Always present the confidence band, not just the point estimate.
"""},
],
)
return explanationThe system prompt explicitly forbids extrapolating beyond the comp data. The temperature is low. The output is structured to surface the confidence band prominently.
5. Refusal on insufficient data
Real estate questions where the data does not support a confident answer should produce abstentions, not hallucinations. "I don't have enough recent comps to estimate this property's value with reasonable confidence" is the right output when the comp set is too thin.
The OpenAI 2025 hallucination paper applies here too: as long as your eval rewards completion over abstention, your model learns to fill gaps with fabrications. Train your eval to reward calibrated abstention.
6. Continuous eval capture from agent overrides
Every time an agent edits an AI-generated description, AVM explanation, or comp recommendation, that becomes a labeled datum. Edits are potential hallucinations, omissions, or stylistic preferences. Tag them and feed them into the regression eval set.
Real estate has a particularly high override rate because agents are licensed and liable for the marketing they put their name on. That high override rate is the gold mine for eval; mine it systematically.
A reference architecture
The smallest defensible real estate AI build:
[User query or trigger]
|
v
[Property identifier resolution]
|
v
[MLS + assessor + tax + permit retrieval]
|
v
[Vision verification on photos]
|
v
[Comp retrieval with hard filters]
|
v
[AVM model (statistical, not LLM)]
|
v
[Grounded LLM generation with citation requirement]
|
v
[Faithfulness check: every fact resolves to source]
|
v
[Output with confidence band, not point estimate]
|
v
[Agent review + edit tracking]
|
v
[Continuous eval capture]
What to ship and in what order
A staged rollout:
- Week 1. MLS-grounded retrieval with citation requirement. Wire faithfulness check at the workflow level.
- Week 2. Build the grounded eval set. 100 properties annotated by a licensed agent, including edge cases (vacant land, condos, mixed-use). The week most often skipped.
- Week 3. Comp retrieval pipeline with hard filters, replacing pure embedding-similarity.
- Week 4. Photo grounding for visual claims. Post-generation flagging on unverified feature claims.
Cross-clause consistency across multi-property comparisons is a stretch goal for week 5.
How Respan fits
Real estate AI hallucinations get caught by the layer of telemetry, evals, and grounding controls sitting around the model. Respan gives you that layer without rebuilding it from scratch.
- Tracing: every property-fact generation captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. MLS retrievals, assessor lookups, vision verifications, comp filters, AVM calls, and the final LLM generation all show up as spans on a single timeline so you can see exactly where a hallucinated square footage or fabricated feature claim entered the pipeline.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on square footage drift, feature hallucinations, comp submarket leaks, and AVM overconfidence before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Route description generation to a cheaper model, AVM explanations to a stronger one, and fall back automatically when a provider rate-limits during peak listing hours.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. The grounded-generation prompt and the AVM hedging prompt live as versioned artifacts your compliance reviewer can sign off on, not strings buried in code.
- Monitors and alerts: faithfulness pass rate, citation resolve rate, comp-filter rejection rate, refusal rate on thin comp sets, agent override rate per template. Slack, email, PagerDuty, webhook. Catch a regression in square footage accuracy or comp drift the same hour it lands in production.
A reasonable starter loop for real estate AI builders:
- Instrument every LLM call with Respan tracing including MLS retrieval, assessor lookup, vision verification, and comp filter spans.
- Pull 200 to 500 production property descriptions and AVM explanations into a dataset and label them for factual accuracy, citation grounding, and submarket appropriateness.
- Wire two or three evaluators that catch the failure modes you most fear (square footage drift, feature hallucinations on closed-floor-plan listings, comp submarket leakage).
- Put your grounded-generation and AVM-explanation prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so vision verification, description generation, and explanation generation can each pick the right model and fall back cleanly under load.
Done in this order, the hallucination rate moves from a thing you hope is low to a number you watch on a dashboard.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
CTA
To wire the fixes above on Respan, start tracing for free, read the docs, or talk to us. For the rest of the Real Estate cluster: the pillar, the Fair Housing compliance spoke, the agent copilot build walkthrough, and the eval spoke.
FAQ
Why do real estate AIs hallucinate "open concept" so often? Marketing-copy training data overweights phrases that sell. The model has learned that "open concept" is a common feature claim, regardless of the specific listing. The fix is structural: require feature claims to map to MLS fields or annotated photo regions, not free-form generation.
Should the LLM produce the AVM estimate or just explain it? Just explain it. The valuation comes from a statistical AVM trained on historical sales. The LLM's job is to make the math legible with appropriate hedging. Letting the LLM produce the estimate creates wildly variable outputs and exposes you to liability.
How do I verify visual feature claims like "stainless appliances"? Vision model verification on the listing photos. Annotated photo regions with reasonable confidence become a second source of truth alongside the MLS structured fields. Claims that fail both verifications get flagged.
What's the right comp time window? Depends on the market. Fast-moving (rapid appreciation, low inventory) wants 60-90 days. Stable wants 180. Slow wants 365. Build the time window into the comp retrieval filter, not into the LLM prompt.
Can I retrain a foundation model on MLS data to reduce hallucinations? Probably not legally, depending on your MLS license. Major MLSes added AI-specific clauses to participation agreements in 2024-2026, and most prohibit using MLS data for foundation-model training without separate license. Verify before you train.
