If you are building an AI sourcing or screening product in 2026, the architecture is no longer a research question. Mercor matches 30,000+ contractors against thousands of project briefs daily at a $10 billion valuation. Eightfold scores against more than a billion candidate profiles. Paradox automates candidate engagement at McDonald's, Unilever, and General Motors. Maki People runs structured assessments across global enterprises. The patterns have converged.

The hard parts are not the LLM call or the embedding model. They are the things that determine whether the product survives audit, procurement, and litigation: defensible matching logic, calibrated scoring, demographic data isolation, end-to-end lineage, and continuous evaluation. A product built without these properties looks fine in demo and fails the moment a serious enterprise security review or a Mobley-style claim arrives.

This post walks through the architecture, identifies where teams typically cut corners, and lays out a 90-day build plan that produces something defensible. It assumes you have read the related posts on legal exposure (Eightfold FCRA), bias audits (LL 144 audit methodology), and evaluation (Recruiting LLM eval). Those define the requirements; this post is the build.

Architecture overview

A simplified view of the production architecture:

[Job description input]
       |
       v
[Job structure extraction]
       |
       v
[Candidate sourcing]                 [Candidate inputs from ATS]
   (search, scrape, recommend)             (resume, profile)
       |                                       |
       +---------------+-----------------------+
                       |
                       v
              [Candidate parsing & enrichment]
                       |
                       v
              [Demographic data isolation barrier]
                       |
                       v
              [Match scoring]
                       |
                       v
              [Ranking and surfacing]
                       |
                       v
              [Recruiter / hiring manager review]
                       |
                       v
              [Outcome capture from ATS]
                       |
                       v
              [Audit trail + continuous monitoring]

Each block is its own engineering subsystem. The hard parts cluster in three places: candidate parsing (gets wrong, everything downstream is wrong), match scoring (where the legal exposure lives), and audit trail (the difference between defensible and indefensible).

Job structure extraction

The system needs a structured representation of the role to match candidates against. Free-form job descriptions are inadequate input for a matching system; they vary too much in quality and completeness across employers.

A working schema:

job:
  job_id: <uuid>
  employer_id: <id>
  posting_date: <ISO date>
  
  required_qualifications:
    - description: <text>
      type: education | experience | skill | certification | language | other
      strict_required: <boolean>
  
  preferred_qualifications:
    - description: <text>
      type: <as above>
      weight: <float 0-1>
  
  responsibilities:
    - description: <text>
      domain: <e.g., "data engineering", "regulatory compliance">
  
  job_characteristics:
    role_family: <taxonomy entry>
    seniority: ic | senior | staff | principal | manager | director | vp | executive
    location_type: onsite | hybrid | remote
    location: <if onsite or hybrid>
    employment_type: full_time | contract | part_time
  
  legal_disclosures:
    - LL_144_notice_required: <boolean>
    - FCRA_disclosure_required: <boolean>
    - state_specific_disclosures: [<list>]

The job_characteristics field powers the LL 144 candidate notice (which must describe "the job qualifications and characteristics that the AEDT will use"). The legal_disclosures field captures jurisdiction-specific requirements that flow through to candidate notices.

Job structure extraction itself is an LLM call. Prompt the model with the raw job description, get back a structured object, validate against schema. Common failure: the LLM hallucinates qualifications that are not in the description. Validation requires checking that each extracted qualification has a textual basis in the input. Structured extraction with required source quotes catches most of these.

Candidate sourcing and parsing

Two ingestion paths:

Inbound applicants from the employer's ATS. The candidate has applied. You receive structured data (name, contact info, resume file) and unstructured data (resume content, cover letter). Parsing extracts skills, experience, education, certifications, and other features.

Outbound sourcing. The system searches public profiles (LinkedIn, GitHub, professional directories, public resume databases) and surfaces candidates who match the job. This is where the FCRA exposure concentrates: outbound sourcing aggregates information about candidates who never opted into the platform's evaluation. Eightfold's defense against the FCRA claim turns on whether their data sources qualify as "consumer information" under the statute.

For outbound sourcing, the architectural pattern that limits exposure: separate the sourcing index from the scoring system. The sourcing index returns candidates who match search terms; the scoring system processes only candidates whose data has been authorized through an employer-mediated flow (the candidate has applied, has been contacted and consented, or is part of an established consent regime). A platform that scores all sourced candidates and surfaces the top-N to employers without authorization is the structure the Eightfold complaint targets.

Candidate parsing schema

A structured representation of the candidate that mirrors the job schema:

candidate:
  candidate_id: <uuid>
  source: ats | linkedin | other_external
  source_authorization_status: applied | consented | unknown
  ingest_timestamp: <ISO timestamp>
  
  identity:
    name: <text>          # NEVER passed to scoring
    contact_info: <text>  # NEVER passed to scoring
  
  parsed_qualifications:
    education:
      - institution: <text>
        degree: <text>
        field: <text>
        graduation_year: <int>
        verified: <boolean>
    
    experience:
      - employer: <text>
        title: <text>
        start_date: <ISO date>
        end_date: <ISO date or "current">
        responsibilities: [<list>]
        skills_demonstrated: [<list>]
    
    skills:
      - skill: <text>
        years_experience: <float>
        evidence_source: <reference to experience entry>
    
    certifications:
      - name: <text>
        issuer: <text>
        date_earned: <ISO date>
        verification_status: verified | unverified
    
    languages:
      - language: <text>
        proficiency: native | fluent | professional | conversational | basic
  
  features_for_scoring:
    # These are the inputs the scoring model actually sees
    # Demographic and identity data is NOT in this section
  
  audit_metadata:
    parsing_model_version: <id>
    parsing_timestamp: <ISO>
    raw_source_documents: [<refs>]

Two important properties of this schema:

Identity vs scoring separation. Name, contact info, and any other identity-revealing data flows into a separate field that the scoring model does not see. The scoring model sees features_for_scoring only. This is enforced at the access control layer, not by code review.

Provenance per parsed field. Every extracted skill links back to a specific evidence source in the experience or certifications. A skill that cannot be traced to a textual source is a hallucination and gets flagged. Hallucinated skills are a known LLM-parsing failure mode and a direct candidate harm.

Demographic data isolation barrier

The single most important architectural boundary in the system. Demographic data (race, sex, age, disability, etc.) flows through one path; scoring inputs flow through another. The two paths never join until the audit and monitoring layer.

A working pattern:

ATS demographic data --> Audit-only data store
                                |
                                v
                        Bias audit pipeline
                        Continuous monitoring
                        (read access only here)


Scoring inputs (no demographic data) --> Scoring model
                                              |
                                              v
                                         Match scores
                                              |
                                              v
                                         Outputs to recruiters

Implementation requires:

Separate database schemas for scoring inputs vs demographic data
IAM access controls preventing scoring services from reading demographic schemas
Code review and CI checks that flag any join between scoring and demographic data outside the audit layer
Periodic access audits to verify the boundary holds

The reason this matters: a model that has read access to demographic data, even if it is "not used" in scoring, creates direct disparate treatment exposure. A model that physically cannot access demographic data has a much stronger defense.

Match scoring

The scoring model takes the structured job and structured candidate and produces a match score with rationale. Several architectural patterns are in use.

Pattern A: LLM as primary scorer

A single LLM call takes the job and candidate as input and produces a structured score with rationale. Most common in newer products.

Pros. Fast to build. Adapts to varied roles without retraining. Produces natural-language rationale.

Cons. Scoring is non-deterministic without temperature 0. Hard to attribute scores to specific features. Calibration is unreliable without post-hoc adjustment. Most expensive at scale.

Pattern B: Hybrid feature model with LLM rationale

Features are extracted programmatically (years of experience, education match, skill overlap) and combined via a learned scoring function (gradient boosted trees, calibrated logistic regression). An LLM produces the human-readable rationale separately, conditioned on the score and features.

Pros. Score is calibrated and reproducible. Feature attribution is explicit. Compliance documentation is straightforward. Scales cheaply.

Cons. Higher engineering investment. Less adaptable to roles outside the trained taxonomy. Requires labeled data for training the scoring function.

Pattern C: LLM as judge of feature model

A feature model produces the score and feature attributions. An LLM reviews the scoring against the candidate's full profile and either confirms or flags. The LLM's flag triggers human review.

Pros. Combines defensibility of feature model with LLM's broader reasoning. Catches edge cases the feature model misses.

Cons. Most expensive. Two systems to maintain. Useful primarily for high-stakes roles.

Choosing among patterns

If your product is	Use pattern
Net-new, building broad role coverage fast	A (LLM primary), with calibration layer
Mature, with labeled data and high-volume specific roles	B (Hybrid)
High-stakes roles (executive search, regulated industries)	C (LLM judge of feature model)
Multi-tenant SaaS, varied employer needs	B with optional A overlay per tenant

Pattern A is most common in 2026; Pattern B is what mature vendors converge to as they accumulate labeled data; Pattern C appears in specific high-value verticals.

Scoring output schema

Regardless of pattern, the output is structured:

match_score:
  score: <float 0-1 or 0-100>
  calibrated_probability: <float 0-1>
  
  rationale:
    overall_summary: <text>
    matched_qualifications:
      - job_requirement: <reference to job schema>
        candidate_evidence: <reference to candidate schema>
        strength: high | medium | low
    
    gaps:
      - job_requirement: <reference>
        gap_description: <text>
        severity: blocker | concern | minor
    
    notable_strengths:
      - description: <text>
        evidence: <reference>
  
  scoring_metadata:
    model_version: <id>
    feature_attributions: <map of feature name to contribution>
    confidence: <float 0-1>
    
  audit_metadata:
    timestamp: <ISO>
    job_version: <id>
    candidate_version: <id>

Every claim in the rationale references either a job requirement or a candidate evidence point. Hallucinated rationale (claims that do not map to evidence) fails post-processing validation. Feature attributions are present whether the model is feature-based or LLM-based; for LLM-based scoring, attributions are computed via prompt-level ablations or learned attribution methods.

Ranking and surfacing

The scoring model produces a score per (job, candidate) pair. Ranking surfaces the top-K candidates per job.

Several design choices have legal and operational implications:

Surface top-N or surface all with scores? Top-N filtering removes candidates from human review. Mobley v. Workday hinged partly on whether Workday's recommendations functionally filtered candidates from consideration. A platform that surfaces all qualified candidates with confidence labels has a stronger "we provide a tool, the human decides" defense than one that filters.

Threshold per role or per recruiter preference? Recruiters often want to set their own thresholds ("only show me 4-stars and up"). The platform-set defaults vs recruiter-customized thresholds split affects audit results: if every recruiter uses different thresholds, selection rate computation is messier. Logging the configured thresholds per evaluation makes the audit possible.

De-duplication and dedup signals. Candidates often appear in multiple sourcing channels with slightly different data. The dedup logic needs to be consistent and traceable; auditors will ask why the same candidate appears in multiple records and how it was resolved.

Pagination and surfacing rules. Many platforms surface only the first page (top 10-20) by default. The candidates beyond the first page are theoretically available but practically invisible. The implicit selection rate is the rate of being on page one, not the rate of being in the database.

Audit trail

The audit trail is what makes the rest of the system defensible. Every evaluation produces a record:

evaluation_record:
  evaluation_id: <uuid>
  timestamp: <ISO>
  
  inputs:
    job_id: <reference>
    job_version: <id>
    candidate_id: <reference>
    candidate_version: <id>
    employer_id: <reference>
    recruiter_id: <if applicable>
  
  processing:
    model_versions:
      - parser: <version>
      - scorer: <version>
      - rationale_generator: <version>
    feature_values: <map>
    score_components: <map>
  
  outputs:
    score: <float>
    rationale: <reference to stored rationale>
    surfaced_to_recruiter: <boolean>
    surfaced_position: <int if surfaced>
  
  downstream_actions:
    recruiter_action: clicked | reviewed | advanced | rejected | none
    employer_action: hired | offered | passed | none
    candidate_action: withdrew | accepted | declined | none
  
  legal_metadata:
    fcra_disclosure_sent: <boolean, timestamp>
    candidate_authorization: <boolean, reference>
    lL_144_notice_provided: <boolean, timestamp>
  
  retention:
    expires_at: <ISO>
    legal_hold: <boolean>

These records support:

LL 144 annual audits (selection rate computation per group)
FCRA dispute response (showing the candidate exactly what data contributed to their score)
Mobley-style discovery (reconstructing how disparate impact arose)
Internal evaluation and continuous monitoring
Customer-facing reporting on platform usage

Storage requirements scale with evaluation volume. A platform processing a million evaluations per day generates roughly 1-10 GB of trace data daily depending on richness. Cold storage (S3, etc.) for older records, hot storage (queryable database) for recent records, with retention rules that satisfy the longest applicable regulatory period.

Continuous evaluation and monitoring

Closes the loop. Continuous monitoring runs on the audit trail data and produces:

Metric	Frequency	Alert threshold
Selection rate per protected group	Weekly	Impact ratio below 0.80
Scoring rate above median per group	Weekly	Impact ratio below 0.80
Calibration ECE	Weekly	ECE above 5%
Calibration ECE per group	Weekly	Group-specific ECE > 8%
Score distribution by demographic	Weekly	Significant distributional shift
Hallucinated rationale rate	Daily	Rate above 1%
Adversarial test suite pass rate	Daily	Any test fails
Recruiter override rate	Weekly	Significant change from baseline
Hire rate per surfaced candidate	Monthly	Significant drop

Alerts route to engineering and ML teams with documented investigation playbooks. The discipline of "every alert investigated, every finding documented, every documented finding remediated or explicitly accepted" is what separates serious operations from compliance theater.

Build order

Each layer of an AI sourcing and screening agent compounds on the integrity of the layer beneath it. Skip a step or ship it half-built and every downstream eval inherits the defect.

Order	What you build	Eval gate before moving on
1	Audit substrate: tool versioning, evaluation record schema, demographic isolation barrier enforced at IAM	100% of LLM calls produce a versioned evaluation record; CI test confirms scoring services cannot read demographic schemas
2	Job structure extraction with required source quotes per qualification	Less than 1% hallucinated qualifications on a 100-job gold set; schema validation passes on 100% of outputs
3	Candidate parsing and enrichment with provenance per field, identity vs scoring separation	Less than 1% hallucinated skills on a 200-resume gold set; zero name or contact fields reachable from `features_for_scoring`
4	Match scoring (Pattern A, B, or C) with feature attributions and grounded rationale	Calibration ECE below 5% on held-out set; 100% of rationale claims trace to a job requirement or candidate evidence reference
5	Ranking, surfacing, and ATS outcome capture with FCRA pre-adverse workflow	Recruiter UI exposes ranks beyond top-N; ATS webhooks capture hire, offer, pass, withdraw on 95%+ of surfaced candidates
6	Continuous bias monitoring and external audit tooling	Weekly impact ratio computed per protected group with alert at 0.80; LL 144 audit export reproduces selection rates from raw evaluation records

After step 6 you book the first annual external audit and expand the adversarial test suite as new attack patterns emerge. Skip the order and you spend the audit window backfilling provenance and demographic isolation under deadline pressure instead of shipping.

Common cuts that cost more later

Patterns that show up across teams that ship fast and remediate slowly:

No tool versioning. Every model and prompt change ships without explicit versioning. The team knows roughly what is in production but cannot reconstruct historical state. The first audit reveals this and forces a backfill that takes weeks.

Demographic data accessible to scoring. Feature engineering or training data includes fields that proxy for protected class. The team "knows" not to use them but the access controls do not enforce it. Disparate treatment exposure is direct.

Outcome data not captured. ATS webhooks not implemented; outcomes are inferred from indirect signals. Audit can compute AEDT recommendations but not actual selection rates. Audit value is reduced.

Top-N filtering by default. Recruiter UI shows only top 10 by default with no easy way to see ranks 11+. The platform is functionally filtering candidates from review. Mobley-style exposure compounds.

No FCRA pre-adverse workflow. When employers use AI scores to reject candidates, no pre-adverse notification flows. If the Eightfold theory holds in court, this becomes a per-rejection FCRA violation.

Compliance theater audit. A self-published bias audit, or an audit by an affiliated party, that does not satisfy LL 144 independence requirements. Marketed as compliance, fails the law.

Eval set frozen at launch. Evaluation runs against the same 200 cases that were assembled in month one. Production behavior diverges from eval behavior. Drift goes undetected.

The teams that avoid these patterns do so by building the foundation first (tool versioning, evaluation records, demographic isolation) and then building features on top. The teams that ship features first end up rebuilding the foundation later under audit pressure.

How Respan fits

Building a defensible AI sourcing and screening agent (parsing, enrichment, matching, ranking, audit trail) is fundamentally an observability and governance problem, and Respan is the substrate that holds the pieces together. The platform wires tracing, evals, gateway, prompt management, and monitors into a single loop that maps directly to LL 144, FCRA, and Mobley-style defensibility.

Tracing: every (job, candidate) evaluation captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a candidate disputes a score under FCRA or an auditor reconstructs how a disparate impact arose, the full lineage from job structure extraction through parsing, scoring, and surfacing is one query away.
Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on hallucinated qualifications, ungrounded rationale claims, calibration drift, and selection-rate impact ratios falling below 0.80 before deploys ship.
Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Routing job extraction, parsing, and scoring calls through a single gateway gives you per-tenant cost ceilings and a uniform place to log model versions for the evaluation record schema.
Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Job structure extraction prompts, candidate parsing prompts, match-scoring prompts, and rationale generators all belong in the registry so every change is auditable and reversible without a deploy.
Monitors and alerts: selection rate per protected group, calibration ECE, hallucinated rationale rate, adversarial test suite pass rate, and recruiter override rate. Slack, email, PagerDuty, webhook. The "every alert investigated, every finding documented" discipline that separates serious operations from compliance theater is enforced by the monitoring layer, not memory.

A reasonable starter loop for AI sourcing and screening builders:

Instrument every LLM call with Respan tracing including job extraction, parsing, scoring, and rationale spans.
Pull 200 to 500 production (job, candidate) evaluations into a dataset and label them for match accuracy, rationale groundedness, and demographic isolation integrity.
Wire two or three evaluators that catch the failure modes you most fear (hallucinated qualifications, ungrounded rationale claims, impact ratio falling below 0.80 per protected group).
Put your job extraction, parsing, and scoring prompts behind the registry so you can version, A/B, and roll back without a deploy.
Route through the gateway so model versions, costs, and fallback behavior are uniform across tenants and logged into every evaluation record.

Skip this loop and the first serious enterprise security review, LL 144 audit, or Mobley-style discovery request becomes a multi-week firefight against a system you cannot reconstruct.

To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.

The Eightfold FCRA Lawsuit and What Algorithmic Hiring Engineers Need to Ship Now: the legal regime
Building Bias Audits for AI Recruiting: annual external audit methodology
Evaluating Recruiting LLMs: four-dimension evaluation framework
How HR Tech Teams Build LLM Apps in 2026: pillar overview

Architecture overview

A simplified view of the production architecture:

[Job description input]
       |
       v
[Job structure extraction]
       |
       v
[Candidate sourcing]                 [Candidate inputs from ATS]
   (search, scrape, recommend)             (resume, profile)
       |                                       |
       +---------------+-----------------------+
                       |
                       v
              [Candidate parsing & enrichment]
                       |
                       v
              [Demographic data isolation barrier]
                       |
                       v
              [Match scoring]
                       |
                       v
              [Ranking and surfacing]
                       |
                       v
              [Recruiter / hiring manager review]
                       |
                       v
              [Outcome capture from ATS]
                       |
                       v
              [Audit trail + continuous monitoring]

Job structure extraction

A working schema:

job:
  job_id: <uuid>
  employer_id: <id>
  posting_date: <ISO date>
  
  required_qualifications:
    - description: <text>
      type: education | experience | skill | certification | language | other
      strict_required: <boolean>
  
  preferred_qualifications:
    - description: <text>
      type: <as above>
      weight: <float 0-1>
  
  responsibilities:
    - description: <text>
      domain: <e.g., "data engineering", "regulatory compliance">
  
  job_characteristics:
    role_family: <taxonomy entry>
    seniority: ic | senior | staff | principal | manager | director | vp | executive
    location_type: onsite | hybrid | remote
    location: <if onsite or hybrid>
    employment_type: full_time | contract | part_time
  
  legal_disclosures:
    - LL_144_notice_required: <boolean>
    - FCRA_disclosure_required: <boolean>
    - state_specific_disclosures: [<list>]

Candidate sourcing and parsing

Two ingestion paths:

Candidate parsing schema

A structured representation of the candidate that mirrors the job schema:

candidate:
  candidate_id: <uuid>
  source: ats | linkedin | other_external
  source_authorization_status: applied | consented | unknown
  ingest_timestamp: <ISO timestamp>
  
  identity:
    name: <text>          # NEVER passed to scoring
    contact_info: <text>  # NEVER passed to scoring
  
  parsed_qualifications:
    education:
      - institution: <text>
        degree: <text>
        field: <text>
        graduation_year: <int>
        verified: <boolean>
    
    experience:
      - employer: <text>
        title: <text>
        start_date: <ISO date>
        end_date: <ISO date or "current">
        responsibilities: [<list>]
        skills_demonstrated: [<list>]
    
    skills:
      - skill: <text>
        years_experience: <float>
        evidence_source: <reference to experience entry>
    
    certifications:
      - name: <text>
        issuer: <text>
        date_earned: <ISO date>
        verification_status: verified | unverified
    
    languages:
      - language: <text>
        proficiency: native | fluent | professional | conversational | basic
  
  features_for_scoring:
    # These are the inputs the scoring model actually sees
    # Demographic and identity data is NOT in this section
  
  audit_metadata:
    parsing_model_version: <id>
    parsing_timestamp: <ISO>
    raw_source_documents: [<refs>]

Two important properties of this schema:

Demographic data isolation barrier

A working pattern:

ATS demographic data --> Audit-only data store
                                |
                                v
                        Bias audit pipeline
                        Continuous monitoring
                        (read access only here)


Scoring inputs (no demographic data) --> Scoring model
                                              |
                                              v
                                         Match scores
                                              |
                                              v
                                         Outputs to recruiters

Implementation requires:

Separate database schemas for scoring inputs vs demographic data
IAM access controls preventing scoring services from reading demographic schemas
Code review and CI checks that flag any join between scoring and demographic data outside the audit layer
Periodic access audits to verify the boundary holds

Match scoring

The scoring model takes the structured job and structured candidate and produces a match score with rationale. Several architectural patterns are in use.

Pattern A: LLM as primary scorer

A single LLM call takes the job and candidate as input and produces a structured score with rationale. Most common in newer products.

Pros. Fast to build. Adapts to varied roles without retraining. Produces natural-language rationale.

Cons. Scoring is non-deterministic without temperature 0. Hard to attribute scores to specific features. Calibration is unreliable without post-hoc adjustment. Most expensive at scale.

Pattern B: Hybrid feature model with LLM rationale

Pros. Score is calibrated and reproducible. Feature attribution is explicit. Compliance documentation is straightforward. Scales cheaply.

Cons. Higher engineering investment. Less adaptable to roles outside the trained taxonomy. Requires labeled data for training the scoring function.

Pattern C: LLM as judge of feature model

A feature model produces the score and feature attributions. An LLM reviews the scoring against the candidate's full profile and either confirms or flags. The LLM's flag triggers human review.

Pros. Combines defensibility of feature model with LLM's broader reasoning. Catches edge cases the feature model misses.

Cons. Most expensive. Two systems to maintain. Useful primarily for high-stakes roles.

Choosing among patterns

If your product is	Use pattern
Net-new, building broad role coverage fast	A (LLM primary), with calibration layer
Mature, with labeled data and high-volume specific roles	B (Hybrid)
High-stakes roles (executive search, regulated industries)	C (LLM judge of feature model)
Multi-tenant SaaS, varied employer needs	B with optional A overlay per tenant

Pattern A is most common in 2026; Pattern B is what mature vendors converge to as they accumulate labeled data; Pattern C appears in specific high-value verticals.

Scoring output schema

Regardless of pattern, the output is structured:

match_score:
  score: <float 0-1 or 0-100>
  calibrated_probability: <float 0-1>
  
  rationale:
    overall_summary: <text>
    matched_qualifications:
      - job_requirement: <reference to job schema>
        candidate_evidence: <reference to candidate schema>
        strength: high | medium | low
    
    gaps:
      - job_requirement: <reference>
        gap_description: <text>
        severity: blocker | concern | minor
    
    notable_strengths:
      - description: <text>
        evidence: <reference>
  
  scoring_metadata:
    model_version: <id>
    feature_attributions: <map of feature name to contribution>
    confidence: <float 0-1>
    
  audit_metadata:
    timestamp: <ISO>
    job_version: <id>
    candidate_version: <id>

Ranking and surfacing

The scoring model produces a score per (job, candidate) pair. Ranking surfaces the top-K candidates per job.

Several design choices have legal and operational implications:

Audit trail

The audit trail is what makes the rest of the system defensible. Every evaluation produces a record:

evaluation_record:
  evaluation_id: <uuid>
  timestamp: <ISO>
  
  inputs:
    job_id: <reference>
    job_version: <id>
    candidate_id: <reference>
    candidate_version: <id>
    employer_id: <reference>
    recruiter_id: <if applicable>
  
  processing:
    model_versions:
      - parser: <version>
      - scorer: <version>
      - rationale_generator: <version>
    feature_values: <map>
    score_components: <map>
  
  outputs:
    score: <float>
    rationale: <reference to stored rationale>
    surfaced_to_recruiter: <boolean>
    surfaced_position: <int if surfaced>
  
  downstream_actions:
    recruiter_action: clicked | reviewed | advanced | rejected | none
    employer_action: hired | offered | passed | none
    candidate_action: withdrew | accepted | declined | none
  
  legal_metadata:
    fcra_disclosure_sent: <boolean, timestamp>
    candidate_authorization: <boolean, reference>
    lL_144_notice_provided: <boolean, timestamp>
  
  retention:
    expires_at: <ISO>
    legal_hold: <boolean>

These records support:

LL 144 annual audits (selection rate computation per group)
FCRA dispute response (showing the candidate exactly what data contributed to their score)
Mobley-style discovery (reconstructing how disparate impact arose)
Internal evaluation and continuous monitoring
Customer-facing reporting on platform usage

Continuous evaluation and monitoring

Closes the loop. Continuous monitoring runs on the audit trail data and produces:

Metric	Frequency	Alert threshold
Selection rate per protected group	Weekly	Impact ratio below 0.80
Scoring rate above median per group	Weekly	Impact ratio below 0.80
Calibration ECE	Weekly	ECE above 5%
Calibration ECE per group	Weekly	Group-specific ECE > 8%
Score distribution by demographic	Weekly	Significant distributional shift
Hallucinated rationale rate	Daily	Rate above 1%
Adversarial test suite pass rate	Daily	Any test fails
Recruiter override rate	Weekly	Significant change from baseline
Hire rate per surfaced candidate	Monthly	Significant drop

Build order

Each layer of an AI sourcing and screening agent compounds on the integrity of the layer beneath it. Skip a step or ship it half-built and every downstream eval inherits the defect.

Order	What you build	Eval gate before moving on
1	Audit substrate: tool versioning, evaluation record schema, demographic isolation barrier enforced at IAM	100% of LLM calls produce a versioned evaluation record; CI test confirms scoring services cannot read demographic schemas
2	Job structure extraction with required source quotes per qualification	Less than 1% hallucinated qualifications on a 100-job gold set; schema validation passes on 100% of outputs
3	Candidate parsing and enrichment with provenance per field, identity vs scoring separation	Less than 1% hallucinated skills on a 200-resume gold set; zero name or contact fields reachable from `features_for_scoring`
4	Match scoring (Pattern A, B, or C) with feature attributions and grounded rationale	Calibration ECE below 5% on held-out set; 100% of rationale claims trace to a job requirement or candidate evidence reference
5	Ranking, surfacing, and ATS outcome capture with FCRA pre-adverse workflow	Recruiter UI exposes ranks beyond top-N; ATS webhooks capture hire, offer, pass, withdraw on 95%+ of surfaced candidates
6	Continuous bias monitoring and external audit tooling	Weekly impact ratio computed per protected group with alert at 0.80; LL 144 audit export reproduces selection rates from raw evaluation records

Common cuts that cost more later

Patterns that show up across teams that ship fast and remediate slowly:

Compliance theater audit. A self-published bias audit, or an audit by an affiliated party, that does not satisfy LL 144 independence requirements. Marketed as compliance, fails the law.

Eval set frozen at launch. Evaluation runs against the same 200 cases that were assembled in month one. Production behavior diverges from eval behavior. Drift goes undetected.

How Respan fits

Tracing: every (job, candidate) evaluation captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. When a candidate disputes a score under FCRA or an auditor reconstructs how a disparate impact arose, the full lineage from job structure extraction through parsing, scoring, and surfacing is one query away.
Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on hallucinated qualifications, ungrounded rationale claims, calibration drift, and selection-rate impact ratios falling below 0.80 before deploys ship.
Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Routing job extraction, parsing, and scoring calls through a single gateway gives you per-tenant cost ceilings and a uniform place to log model versions for the evaluation record schema.
Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Job structure extraction prompts, candidate parsing prompts, match-scoring prompts, and rationale generators all belong in the registry so every change is auditable and reversible without a deploy.
Monitors and alerts: selection rate per protected group, calibration ECE, hallucinated rationale rate, adversarial test suite pass rate, and recruiter override rate. Slack, email, PagerDuty, webhook. The "every alert investigated, every finding documented" discipline that separates serious operations from compliance theater is enforced by the monitoring layer, not memory.

A reasonable starter loop for AI sourcing and screening builders:

Instrument every LLM call with Respan tracing including job extraction, parsing, scoring, and rationale spans.
Pull 200 to 500 production (job, candidate) evaluations into a dataset and label them for match accuracy, rationale groundedness, and demographic isolation integrity.
Wire two or three evaluators that catch the failure modes you most fear (hallucinated qualifications, ungrounded rationale claims, impact ratio falling below 0.80 per protected group).
Put your job extraction, parsing, and scoring prompts behind the registry so you can version, A/B, and roll back without a deploy.
Route through the gateway so model versions, costs, and fallback behavior are uniform across tenants and logged into every evaluation record.

Skip this loop and the first serious enterprise security review, LL 144 audit, or Mobley-style discovery request becomes a multi-week firefight against a system you cannot reconstruct.

To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.

The Eightfold FCRA Lawsuit and What Algorithmic Hiring Engineers Need to Ship Now: the legal regime
Building Bias Audits for AI Recruiting: annual external audit methodology
Evaluating Recruiting LLMs: four-dimension evaluation framework
How HR Tech Teams Build LLM Apps in 2026: pillar overview

Building an AI Sourcing and Screening Agent

Architecture overview

Job structure extraction

Candidate sourcing and parsing

Candidate parsing schema

Demographic data isolation barrier

Match scoring

Pattern A: LLM as primary scorer

Pattern B: Hybrid feature model with LLM rationale

Pattern C: LLM as judge of feature model

Choosing among patterns

Scoring output schema

Ranking and surfacing

Audit trail

Continuous evaluation and monitoring

Build order

Common cuts that cost more later

How Respan fits

Built for AI agents.
Break less.
Ship more.

Building an AI Sourcing and Screening Agent

Architecture overview

Job structure extraction

Candidate sourcing and parsing

Candidate parsing schema

Demographic data isolation barrier

Match scoring

Pattern A: LLM as primary scorer

Pattern B: Hybrid feature model with LLM rationale

Pattern C: LLM as judge of feature model

Choosing among patterns

Scoring output schema

Ranking and surfacing

Audit trail

Continuous evaluation and monitoring

Build order

Common cuts that cost more later

How Respan fits

Built for AI agents.
Break less.
Ship more.

Building an AI Sourcing and Screening Agent

Architecture overview

Job structure extraction

Candidate sourcing and parsing

Candidate parsing schema

Demographic data isolation barrier

Match scoring

Pattern A: LLM as primary scorer

Pattern B: Hybrid feature model with LLM rationale

Pattern C: LLM as judge of feature model

Choosing among patterns

Scoring output schema

Ranking and surfacing

Audit trail

Continuous evaluation and monitoring

Build order

Common cuts that cost more later

How Respan fits

Related reading

Built for AI agents. Break less. Ship more.

Building an AI Sourcing and Screening Agent

Architecture overview

Job structure extraction

Candidate sourcing and parsing

Candidate parsing schema

Demographic data isolation barrier

Match scoring

Pattern A: LLM as primary scorer

Pattern B: Hybrid feature model with LLM rationale

Pattern C: LLM as judge of feature model

Choosing among patterns

Scoring output schema

Ranking and surfacing

Audit trail

Continuous evaluation and monitoring

Build order

Common cuts that cost more later

How Respan fits

Related reading

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.