If you are building an AI sourcing or screening product in 2026, the architecture is no longer a research question. Mercor matches 30,000+ contractors against thousands of project briefs daily at a $10 billion valuation. Eightfold scores against more than a billion candidate profiles. Paradox automates candidate engagement at McDonald's, Unilever, and General Motors. Maki People runs structured assessments across global enterprises. The patterns have converged.
The hard parts are not the LLM call or the embedding model. They are the things that determine whether the product survives audit, procurement, and litigation: defensible matching logic, calibrated scoring, demographic data isolation, end-to-end lineage, and continuous evaluation. A product built without these properties looks fine in demo and fails the moment a serious enterprise security review or a Mobley-style claim arrives.
This post walks through the architecture, identifies where teams typically cut corners, and lays out a 90-day build plan that produces something defensible. It assumes you have read the related posts on legal exposure (Eightfold FCRA), bias audits (LL 144 audit methodology), and evaluation (Recruiting LLM eval). Those define the requirements; this post is the build.
Architecture overview
A simplified view of the production architecture:
[Job description input]
|
v
[Job structure extraction]
|
v
[Candidate sourcing] [Candidate inputs from ATS]
(search, scrape, recommend) (resume, profile)
| |
+---------------+-----------------------+
|
v
[Candidate parsing & enrichment]
|
v
[Demographic data isolation barrier]
|
v
[Match scoring]
|
v
[Ranking and surfacing]
|
v
[Recruiter / hiring manager review]
|
v
[Outcome capture from ATS]
|
v
[Audit trail + continuous monitoring]
Each block is its own engineering subsystem. The hard parts cluster in three places: candidate parsing (gets wrong, everything downstream is wrong), match scoring (where the legal exposure lives), and audit trail (the difference between defensible and indefensible).
Job structure extraction
The system needs a structured representation of the role to match candidates against. Free-form job descriptions are inadequate input for a matching system; they vary too much in quality and completeness across employers.
A working schema:
job:
job_id: <uuid>
employer_id: <id>
posting_date: <ISO date>
required_qualifications:
- description: <text>
type: education | experience | skill | certification | language | other
strict_required: <boolean>
preferred_qualifications:
- description: <text>
type: <as above>
weight: <float 0-1>
responsibilities:
- description: <text>
domain: <e.g., "data engineering", "regulatory compliance">
job_characteristics:
role_family: <taxonomy entry>
seniority: ic | senior | staff | principal | manager | director | vp | executive
location_type: onsite | hybrid | remote
location: <if onsite or hybrid>
employment_type: full_time | contract | part_time
legal_disclosures:
- LL_144_notice_required: <boolean>
- FCRA_disclosure_required: <boolean>
- state_specific_disclosures: [<list>]The job_characteristics field powers the LL 144 candidate notice (which must describe "the job qualifications and characteristics that the AEDT will use"). The legal_disclosures field captures jurisdiction-specific requirements that flow through to candidate notices.
Job structure extraction itself is an LLM call. Prompt the model with the raw job description, get back a structured object, validate against schema. Common failure: the LLM hallucinates qualifications that are not in the description. Validation requires checking that each extracted qualification has a textual basis in the input. Structured extraction with required source quotes catches most of these.
Candidate sourcing and parsing
Two ingestion paths:
Inbound applicants from the employer's ATS. The candidate has applied. You receive structured data (name, contact info, resume file) and unstructured data (resume content, cover letter). Parsing extracts skills, experience, education, certifications, and other features.
Outbound sourcing. The system searches public profiles (LinkedIn, GitHub, professional directories, public resume databases) and surfaces candidates who match the job. This is where the FCRA exposure concentrates: outbound sourcing aggregates information about candidates who never opted into the platform's evaluation. Eightfold's defense against the FCRA claim turns on whether their data sources qualify as "consumer information" under the statute.
For outbound sourcing, the architectural pattern that limits exposure: separate the sourcing index from the scoring system. The sourcing index returns candidates who match search terms; the scoring system processes only candidates whose data has been authorized through an employer-mediated flow (the candidate has applied, has been contacted and consented, or is part of an established consent regime). A platform that scores all sourced candidates and surfaces the top-N to employers without authorization is the structure the Eightfold complaint targets.
Candidate parsing schema
A structured representation of the candidate that mirrors the job schema:
candidate:
candidate_id: <uuid>
source: ats | linkedin | other_external
source_authorization_status: applied | consented | unknown
ingest_timestamp: <ISO timestamp>
identity:
name: <text> # NEVER passed to scoring
contact_info: <text> # NEVER passed to scoring
parsed_qualifications:
education:
- institution: <text>
degree: <text>
field: <text>
graduation_year: <int>
verified: <boolean>
experience:
- employer: <text>
title: <text>
start_date: <ISO date>
end_date: <ISO date or "current">
responsibilities: [<list>]
skills_demonstrated: [<list>]
skills:
- skill: <text>
years_experience: <float>
evidence_source: <reference to experience entry>
certifications:
- name: <text>
issuer: <text>
date_earned: <ISO date>
verification_status: verified | unverified
languages:
- language: <text>
proficiency: native | fluent | professional | conversational | basic
features_for_scoring:
# These are the inputs the scoring model actually sees
# Demographic and identity data is NOT in this section
audit_metadata:
parsing_model_version: <id>
parsing_timestamp: <ISO>
raw_source_documents: [<refs>]Two important properties of this schema:
Identity vs scoring separation. Name, contact info, and any other identity-revealing data flows into a separate field that the scoring model does not see. The scoring model sees features_for_scoring only. This is enforced at the access control layer, not by code review.
Provenance per parsed field. Every extracted skill links back to a specific evidence source in the experience or certifications. A skill that cannot be traced to a textual source is a hallucination and gets flagged. Hallucinated skills are a known LLM-parsing failure mode and a direct candidate harm.
Demographic data isolation barrier
The single most important architectural boundary in the system. Demographic data (race, sex, age, disability, etc.) flows through one path; scoring inputs flow through another. The two paths never join until the audit and monitoring layer.
A working pattern:
ATS demographic data --> Audit-only data store
|
v
Bias audit pipeline
Continuous monitoring
(read access only here)
Scoring inputs (no demographic data) --> Scoring model
|
v
Match scores
|
v
Outputs to recruiters
Implementation requires:
- Separate database schemas for scoring inputs vs demographic data
- IAM access controls preventing scoring services from reading demographic schemas
- Code review and CI checks that flag any join between scoring and demographic data outside the audit layer
- Periodic access audits to verify the boundary holds
The reason this matters: a model that has read access to demographic data, even if it is "not used" in scoring, creates direct disparate treatment exposure. A model that physically cannot access demographic data has a much stronger defense.
Match scoring
The scoring model takes the structured job and structured candidate and produces a match score with rationale. Several architectural patterns are in use.
Pattern A: LLM as primary scorer
A single LLM call takes the job and candidate as input and produces a structured score with rationale. Most common in newer products.
Pros. Fast to build. Adapts to varied roles without retraining. Produces natural-language rationale.
Cons. Scoring is non-deterministic without temperature 0. Hard to attribute scores to specific features. Calibration is unreliable without post-hoc adjustment. Most expensive at scale.
Pattern B: Hybrid feature model with LLM rationale
Features are extracted programmatically (years of experience, education match, skill overlap) and combined via a learned scoring function (gradient boosted trees, calibrated logistic regression). An LLM produces the human-readable rationale separately, conditioned on the score and features.
Pros. Score is calibrated and reproducible. Feature attribution is explicit. Compliance documentation is straightforward. Scales cheaply.
Cons. Higher engineering investment. Less adaptable to roles outside the trained taxonomy. Requires labeled data for training the scoring function.
Pattern C: LLM as judge of feature model
A feature model produces the score and feature attributions. An LLM reviews the scoring against the candidate's full profile and either confirms or flags. The LLM's flag triggers human review.
Pros. Combines defensibility of feature model with LLM's broader reasoning. Catches edge cases the feature model misses.
Cons. Most expensive. Two systems to maintain. Useful primarily for high-stakes roles.
Choosing among patterns
| If your product is | Use pattern |
|---|---|
| Net-new, building broad role coverage fast | A (LLM primary), with calibration layer |
| Mature, with labeled data and high-volume specific roles | B (Hybrid) |
| High-stakes roles (executive search, regulated industries) | C (LLM judge of feature model) |
| Multi-tenant SaaS, varied employer needs | B with optional A overlay per tenant |
Pattern A is most common in 2026; Pattern B is what mature vendors converge to as they accumulate labeled data; Pattern C appears in specific high-value verticals.
Scoring output schema
Regardless of pattern, the output is structured:
match_score:
score: <float 0-1 or 0-100>
calibrated_probability: <float 0-1>
rationale:
overall_summary: <text>
matched_qualifications:
- job_requirement: <reference to job schema>
candidate_evidence: <reference to candidate schema>
strength: high | medium | low
gaps:
- job_requirement: <reference>
gap_description: <text>
severity: blocker | concern | minor
notable_strengths:
- description: <text>
evidence: <reference>
scoring_metadata:
model_version: <id>
feature_attributions: <map of feature name to contribution>
confidence: <float 0-1>
audit_metadata:
timestamp: <ISO>
job_version: <id>
candidate_version: <id>Every claim in the rationale references either a job requirement or a candidate evidence point. Hallucinated rationale (claims that do not map to evidence) fails post-processing validation. Feature attributions are present whether the model is feature-based or LLM-based; for LLM-based scoring, attributions are computed via prompt-level ablations or learned attribution methods.
Ranking and surfacing
The scoring model produces a score per (job, candidate) pair. Ranking surfaces the top-K candidates per job.
Several design choices have legal and operational implications:
Surface top-N or surface all with scores? Top-N filtering removes candidates from human review. Mobley v. Workday hinged partly on whether Workday's recommendations functionally filtered candidates from consideration. A platform that surfaces all qualified candidates with confidence labels has a stronger "we provide a tool, the human decides" defense than one that filters.
Threshold per role or per recruiter preference? Recruiters often want to set their own thresholds ("only show me 4-stars and up"). The platform-set defaults vs recruiter-customized thresholds split affects audit results: if every recruiter uses different thresholds, selection rate computation is messier. Logging the configured thresholds per evaluation makes the audit possible.
De-duplication and dedup signals. Candidates often appear in multiple sourcing channels with slightly different data. The dedup logic needs to be consistent and traceable; auditors will ask why the same candidate appears in multiple records and how it was resolved.
Pagination and surfacing rules. Many platforms surface only the first page (top 10-20) by default. The candidates beyond the first page are theoretically available but practically invisible. The implicit selection rate is the rate of being on page one, not the rate of being in the database.
Audit trail
The audit trail is what makes the rest of the system defensible. Every evaluation produces a record:
evaluation_record:
evaluation_id: <uuid>
timestamp: <ISO>
inputs:
job_id: <reference>
job_version: <id>
candidate_id: <reference>
candidate_version: <id>
employer_id: <reference>
recruiter_id: <if applicable>
processing:
model_versions:
- parser: <version>
- scorer: <version>
- rationale_generator: <version>
feature_values: <map>
score_components: <map>
outputs:
score: <float>
rationale: <reference to stored rationale>
surfaced_to_recruiter: <boolean>
surfaced_position: <int if surfaced>
downstream_actions:
recruiter_action: clicked | reviewed | advanced | rejected | none
employer_action: hired | offered | passed | none
candidate_action: withdrew | accepted | declined | none
legal_metadata:
fcra_disclosure_sent: <boolean, timestamp>
candidate_authorization: <boolean, reference>
lL_144_notice_provided: <boolean, timestamp>
retention:
expires_at: <ISO>
legal_hold: <boolean>These records support:
- LL 144 annual audits (selection rate computation per group)
- FCRA dispute response (showing the candidate exactly what data contributed to their score)
- Mobley-style discovery (reconstructing how disparate impact arose)
- Internal evaluation and continuous monitoring
- Customer-facing reporting on platform usage
Storage requirements scale with evaluation volume. A platform processing a million evaluations per day generates roughly 1-10 GB of trace data daily depending on richness. Cold storage (S3, etc.) for older records, hot storage (queryable database) for recent records, with retention rules that satisfy the longest applicable regulatory period.
Continuous evaluation and monitoring
Closes the loop. Continuous monitoring runs on the audit trail data and produces:
| Metric | Frequency | Alert threshold |
|---|---|---|
| Selection rate per protected group | Weekly | Impact ratio below 0.80 |
| Scoring rate above median per group | Weekly | Impact ratio below 0.80 |
| Calibration ECE | Weekly | ECE above 5% |
| Calibration ECE per group | Weekly | Group-specific ECE > 8% |
| Score distribution by demographic | Weekly | Significant distributional shift |
| Hallucinated rationale rate | Daily | Rate above 1% |
| Adversarial test suite pass rate | Daily | Any test fails |
| Recruiter override rate | Weekly | Significant change from baseline |
| Hire rate per surfaced candidate | Monthly | Significant drop |
Alerts route to engineering and ML teams with documented investigation playbooks. The discipline of "every alert investigated, every finding documented, every documented finding remediated or explicitly accepted" is what separates serious operations from compliance theater.
Build order
Each layer of an AI sourcing and screening agent compounds on the integrity of the layer beneath it. Skip a step or ship it half-built and every downstream eval inherits the defect.
| Order | What you build | Eval gate before moving on |
|---|---|---|
| 1 | Audit substrate: tool versioning, evaluation record schema, demographic isolation barrier enforced at IAM | 100% of LLM calls produce a versioned evaluation record; CI test confirms scoring services cannot read demographic schemas |
| 2 | Job structure extraction with required source quotes per qualification | Less than 1% hallucinated qualifications on a 100-job gold set; schema validation passes on 100% of outputs |
| 3 | Candidate parsing and enrichment with provenance per field, identity vs scoring separation | Less than 1% hallucinated skills on a 200-resume gold set; zero name or contact fields reachable from features_for_scoring |
| 4 | Match scoring (Pattern A, B, or C) with feature attributions and grounded rationale | Calibration ECE below 5% on held-out set; 100% of rationale claims trace to a job requirement or candidate evidence reference |
| 5 | Ranking, surfacing, and ATS outcome capture with FCRA pre-adverse workflow | Recruiter UI exposes ranks beyond top-N; ATS webhooks capture hire, offer, pass, withdraw on 95%+ of surfaced candidates |
| 6 | Continuous bias monitoring and external audit tooling | Weekly impact ratio computed per protected group with alert at 0.80; LL 144 audit export reproduces selection rates from raw evaluation records |
After step 6 you book the first annual external audit and expand the adversarial test suite as new attack patterns emerge. Skip the order and you spend the audit window backfilling provenance and demographic isolation under deadline pressure instead of shipping.
Common cuts that cost more later
Patterns that show up across teams that ship fast and remediate slowly:
No tool versioning. Every model and prompt change ships without explicit versioning. The team knows roughly what is in production but cannot reconstruct historical state. The first audit reveals this and forces a backfill that takes weeks.
Demographic data accessible to scoring. Feature engineering or training data includes fields that proxy for protected class. The team "knows" not to use them but the access controls do not enforce it. Disparate treatment exposure is direct.
Outcome data not captured. ATS webhooks not implemented; outcomes are inferred from indirect signals. Audit can compute AEDT recommendations but not actual selection rates. Audit value is reduced.
Top-N filtering by default. Recruiter UI shows only top 10 by default with no easy way to see ranks 11+. The platform is functionally filtering candidates from review. Mobley-style exposure compounds.
No FCRA pre-adverse workflow. When employers use AI scores to reject candidates, no pre-adverse notification flows. If the Eightfold theory holds in court, this becomes a per-rejection FCRA violation.
Compliance theater audit. A self-published bias audit, or an audit by an affiliated party, that does not satisfy LL 144 independence requirements. Marketed as compliance, fails the law.
Eval set frozen at launch. Evaluation runs against the same 200 cases that were assembled in month one. Production behavior diverges from eval behavior. Drift goes undetected.
The teams that avoid these patterns do so by building the foundation first (tool versioning, evaluation records, demographic isolation) and then building features on top. The teams that ship features first end up rebuilding the foundation later under audit pressure.
How Respan fits
Building a defensible AI sourcing and screening agent (parsing, enrichment, matching, ranking, audit trail) is fundamentally an observability and governance problem, and Respan is the substrate that holds the pieces together. The platform wires tracing, evals, gateway, prompt management, and monitors into a single loop that maps directly to LL 144, FCRA, and Mobley-style defensibility.
- Tracing: every (job, candidate) evaluation captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a candidate disputes a score under FCRA or an auditor reconstructs how a disparate impact arose, the full lineage from job structure extraction through parsing, scoring, and surfacing is one query away.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on hallucinated qualifications, ungrounded rationale claims, calibration drift, and selection-rate impact ratios falling below 0.80 before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Routing job extraction, parsing, and scoring calls through a single gateway gives you per-tenant cost ceilings and a uniform place to log model versions for the evaluation record schema.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Job structure extraction prompts, candidate parsing prompts, match-scoring prompts, and rationale generators all belong in the registry so every change is auditable and reversible without a deploy.
- Monitors and alerts: selection rate per protected group, calibration ECE, hallucinated rationale rate, adversarial test suite pass rate, and recruiter override rate. Slack, email, PagerDuty, webhook. The "every alert investigated, every finding documented" discipline that separates serious operations from compliance theater is enforced by the monitoring layer, not memory.
A reasonable starter loop for AI sourcing and screening builders:
- Instrument every LLM call with Respan tracing including job extraction, parsing, scoring, and rationale spans.
- Pull 200 to 500 production (job, candidate) evaluations into a dataset and label them for match accuracy, rationale groundedness, and demographic isolation integrity.
- Wire two or three evaluators that catch the failure modes you most fear (hallucinated qualifications, ungrounded rationale claims, impact ratio falling below 0.80 per protected group).
- Put your job extraction, parsing, and scoring prompts behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so model versions, costs, and fallback behavior are uniform across tenants and logged into every evaluation record.
Skip this loop and the first serious enterprise security review, LL 144 audit, or Mobley-style discovery request becomes a multi-week firefight against a system you cannot reconstruct.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- The Eightfold FCRA Lawsuit and What Algorithmic Hiring Engineers Need to Ship Now: the legal regime
- Building Bias Audits for AI Recruiting: annual external audit methodology
- Evaluating Recruiting LLMs: four-dimension evaluation framework
- How HR Tech Teams Build LLM Apps in 2026: pillar overview
