NYC Local Law 144 has required annual independent bias audits of Automated Employment Decision Tools since July 2023. The December 2025 New York State Comptroller audit of NYC DCWP enforcement made clear that compliance through the first 18 months was uneven, and the agency has since committed to tighter enforcement. Illinois HB 3773, which amended the Illinois Human Rights Act to explicitly cover AI-driven employment discrimination, took effect January 1, 2026. Colorado's AI Act extended similar high-risk AI obligations from February 2026. The EU AI Act classifies AI in HR as high-risk; full effective date is August 2026.
The result: bias audits have transitioned from a marketing artifact to a procurement requirement. Enterprise buyers ask for them in the first round of vendor security review. Audit failures show up in litigation. Audits done badly create more exposure than no audit, because they document the platform's flaws in writing.
This post covers what a bias audit actually consists of when done seriously, what datasets and pipelines you need to support an audit, and the failure modes that show up most often in real audits. It is written for engineering teams building or operating AI recruiting platforms, not for compliance teams writing policy.
What a bias audit is and is not
A bias audit, as required by NYC LL 144 and the equivalent provisions in newer state laws, is a quantitative analysis by an independent third party that measures the disparate impact of an Automated Employment Decision Tool on protected demographic groups.
What it is:
- A statistical analysis of selection rates and impact ratios per protected group
- Conducted on either historical use data or representative test data
- Performed by a party with no financial interest in the tool
- Documented in a written report with methodology and findings
- Published in summary form on the employer's website
What it is not:
- A code review or model audit
- A review of training data composition (some auditors do this, but it is not the LL 144 requirement)
- A statement of whether the tool "is biased" (the audit measures specific quantitative outcomes; characterization is the auditor's interpretation)
- A pass/fail certification (the law does not specify what to do if bias is found, only that it must be measured and disclosed)
- A substitute for fair lending or fair hiring testing on the underlying model design
The framing matters because engineering teams sometimes treat the audit as a separate compliance task rather than a consequence of their model design choices. The audit is a measurement; if the measurement reveals disparate impact, the fix is in the model and data pipeline, not in the audit document.
The core metrics
Two metrics drive the audit. Both can be computed in straightforward statistical software.
Selection rate. For a binary AEDT (advance candidate to next round, yes or no), the selection rate per group is the fraction of candidates in that group who were advanced. For a continuous-score AEDT (a Match Score from 0 to 5, for example), the audit uses a "scoring rate" defined as the fraction of candidates in each group who scored above the median.
Impact ratio. The impact ratio for any group is that group's selection rate divided by the highest selection rate among any group. The four-fifths rule (originating from EEOC's Uniform Guidelines on Employee Selection Procedures, 29 CFR 1607) treats an impact ratio below 0.80 as presumptive evidence of disparate impact. The four-fifths rule is not a legal threshold; it is an investigative trigger. Auditors report impact ratios regardless of whether they exceed 0.80.
LL 144 requires the audit to compute these metrics across:
- Race / ethnicity (using EEO-1 categories: Hispanic/Latino, White, Black/African American, Asian, Native Hawaiian/Pacific Islander, American Indian/Alaska Native, Two or More Races)
- Sex (Male, Female; the law uses binary categories tracking EEO-1)
- Intersectional categories (e.g., Hispanic women, Asian men, Black women)
- An "Unknown" category for candidates whose demographic data was not collected
Categories representing less than 2% of the audit dataset can be excluded from the impact ratio calculation per the DCWP's final rules. The auditor decides which categories qualify.
A simplified audit output table might look like:
Selection rates and impact ratios for AEDT version 4.2.1
Audit period: 2025-04-01 to 2026-03-31
Sample: 12,847 candidates across 137 employers
Group | N | Selected | Rate | Impact ratio
-----------------------------------|--------|----------|--------|-------------
White | 5,234 | 1,308 | 0.250 | 1.00 (ref)
Hispanic/Latino | 2,891 | 651 | 0.225 | 0.90
Black/African American | 1,876 | 357 | 0.190 | 0.76*
Asian | 1,944 | 583 | 0.300 | 1.20
Two or More Races | 612 | 138 | 0.225 | 0.90
Native Hawaiian/Pacific Islander | 198 | 38 | 0.192 | 0.77
American Indian/Alaska Native | 92 | - | - | excluded (<2%)
Male | 7,421 | 1,855 | 0.250 | 1.00 (ref)
Female | 5,212 | 1,094 | 0.210 | 0.84
Hispanic Female | 1,389 | 264 | 0.190 | 0.76*
Hispanic Male | 1,502 | 387 | 0.258 | 1.03
Black Female | 1,034 | 165 | 0.160 | 0.64*
Black Male | 842 | 192 | 0.228 | 0.91
[etc.]
* indicates impact ratio below 0.80 (presumptive disparate impact)
The starred rows are where the engineering work focuses. An impact ratio below 0.80 does not automatically violate the law, but it is reportable, public, and visible to litigants.
What auditors actually want
The auditor's job is to produce a defensible report. Their data requirements reflect that. The standard request to a vendor or employer:
Data scope.
- Candidate-level records covering the audit period (typically 12 months or the period since the last audit)
- One row per evaluation by the AEDT, including candidates who did not advance
- Tool version and configuration in effect for each evaluation
- Final disposition (advanced, not advanced, withdrawn, hired, rejected post-interview)
- Demographic data per candidate (typically from the employer's ATS, not from the AEDT)
Data quality requirements.
- Demographic data coverage above a defensible threshold (most auditors target 80%+ coverage; below that, the audit's statistical power degrades and the auditor may recommend more data collection before issuing a finding)
- Outcome data accuracy: when the AEDT's recommendation is recorded as "advance," that record matches what the candidate actually experienced
- Configuration consistency: if the tool was reconfigured during the period (different scoring thresholds, different feature weights), the audit either treats each configuration separately or uses a representative configuration
Methodology documentation.
- A description of how the AEDT works at a level the auditor can reason about (input fields, scoring approach, score-to-decision mapping)
- The "job qualifications and characteristics" the tool assesses, as required by LL 144 candidate notices
- Any human-in-the-loop steps that intervene between AEDT output and final decision
The auditor does not need access to your model weights or training code. They need access to the inputs and outputs at a candidate level, with enough metadata to interpret what they are seeing.
Where audits typically fail
Patterns that show up across vendor audits in the first two years of LL 144 enforcement:
Insufficient demographic data coverage. The AEDT does not collect demographic information; the employer's ATS does, but the linkage between AEDT records and ATS demographic records is broken or incomplete. The auditor cannot compute selection rates per group at acceptable statistical power. The audit either delays or proceeds with caveats that limit its defensive value.
Data drift across the audit period. The tool was changed during the audit period without versioning. Selection rates appear to vary not because of demographic effects but because the tool itself behaved differently at different times. The auditor either refuses to issue a single audit covering the period or issues multiple audits per configuration. Both are messier than a clean per-version audit.
Missing post-AEDT outcomes. The AEDT scores candidates but the platform does not capture whether the candidate was actually advanced by the employer. Selection rate computation requires the actual selection outcome, not just the recommendation. Without it, the audit measures something other than the AEDT's selection effect.
Intersectional categories missing. The audit covers race and sex separately but not intersectionally. LL 144 explicitly requires intersectional analysis (e.g., Hispanic women treated as a separate category from "Hispanic" or "Female" alone). An audit that omits intersectional categories does not satisfy the law and triggers republication.
Confusing tool versions in the public summary. The audit covers tool version 4.2.1, but the employer is currently using version 4.3 because the vendor pushed an update mid-year. The published summary refers to a version no longer in production. Either the employer is non-compliant (using an unaudited tool) or the summary is misleading. Both create exposure.
Self-audit by the vendor. The vendor publishes a "bias audit" performed by the vendor's own data science team or a closely affiliated party. LL 144's independence requirements are not met. The audit is technically not an audit under the law, even if the analysis is methodologically correct.
Architectural pre-conditions for a defensible audit
The infrastructure that makes audits clean has to be in place before the audit period starts. Building it after the auditor arrives costs more than building it from day one.
1. Tool versioning and configuration registry
Every change to the AEDT has a version number. Every candidate evaluation records which version evaluated them. A registry maps versions to their feature lists, scoring functions, and effective date ranges. Auditors should be able to ask "what did version 4.2.1 do, when was it in production, and how many candidates did it score" and get a structured answer in minutes.
This is the single highest-leverage piece of audit infrastructure. Without it, every other audit task gets harder.
2. Per-candidate evaluation records
Every time the AEDT scores a candidate, the system records:
- Candidate ID (linkable to the employer's ATS)
- Job ID (the role the candidate applied to)
- Employer ID
- Tool version
- Input feature values
- Output score, ranking, and recommendation
- Decision threshold or rule applied
- Timestamp
These records are the audit's primary input. They need to be retained for the audit period and beyond (the report itself typically references the data window; some jurisdictions require longer retention).
3. Demographic data linkage layer
The candidate's demographic data lives in the employer's ATS. The audit needs to join this data to the AEDT's evaluation records. Two patterns work:
- ATS-side join. The employer extracts a joined dataset and provides it to the auditor. The vendor never sees the demographic data.
- Vendor-side audit space. The vendor maintains an audit-only data store with limited access, where ATS demographic data is loaded for audit purposes and then deleted.
The first pattern is cleaner from a privacy standpoint. The second is faster operationally. Either works for the audit; both require infrastructure built before the audit starts.
A critical constraint: demographic data must not be accessible to the AEDT at training or inference time. Access controls have to enforce this at the infrastructure layer. A vendor whose model has read access to demographic fields has direct discrimination exposure regardless of audit outcome.
4. Outcome capture
The candidate's outcome (advanced, not advanced, hired) needs to flow back to the AEDT records. Without outcome data, the audit cannot compute selection rates against actual selections, only against AEDT recommendations.
The standard approach is a webhook from the ATS back to the AEDT platform whenever a candidate's status changes. The webhook updates the AEDT record with the disposition. This is plumbing engineering, not glamorous, but it is the difference between an audit that measures something useful and an audit that measures the AEDT's behavior in a vacuum.
5. Continuous monitoring
The annual audit is the visible compliance artifact. Continuous monitoring is the engineering practice that catches problems between audits.
Selection rate and impact ratio computed at a daily or weekly cadence on the prior 30 to 90 days of data. Alerting when impact ratio crosses the 0.80 threshold for any group. Dashboard for engineering and product to see drift before it shows up in an audit.
This is also the input to your "effective challenge" obligations under MRM-style regulatory frameworks (which apply directly in financial services and by analogy in other regulated employment contexts) and to the disparate impact monitoring that Mobley-style litigation relies on for damages computation.
Choosing an auditor
LL 144's independence requirements are stricter than they look. The auditor must:
- Have no financial interest in the AEDT being audited
- Not be the vendor or affiliated with the vendor
- Not have employment with the employer using the AEDT during the audit period or for two years prior
- Not have a financial interest in any party that has a material stake in the audit outcome
DCWP does not maintain a list of approved auditors. Selection is the employer's responsibility. The auditor's qualifications matter for the audit's defensibility:
| Qualification | Why it matters |
|---|---|
| Statistical and methodological expertise | Selection rate, impact ratio, confidence interval, sample size adequacy are statistical computations |
| Familiarity with EEOC Uniform Guidelines | The four-fifths rule and disparate impact framework come from these guidelines; auditors who know the framework produce more defensible reports |
| AI/ML domain knowledge | Understanding how the AEDT actually produces scores affects whether the audit measures the right thing |
| Documented methodology | An auditor with a published, peer-reviewed methodology is more defensible than one with an opaque process |
Several specialist firms (Warden AI, BABL AI, Holistic AI, Eticas, Mind Foundry, others) have built businesses around AI audits. Big-four firms (Deloitte, EY, PwC, KPMG) increasingly compete in this space. Academic researchers can also serve as independent auditors for individual engagements, though typically without the operational scale to handle annual recurring audits.
What a healthy audit cycle looks like
A vendor or employer with audit infrastructure in place runs a cycle that looks roughly like this each year.
Continuous (year-round). Selection rate and impact ratio computed weekly. Alerts on threshold breaches investigated. Tool versioning and configuration changes logged. Outcome data flows back from ATS systems on a near-real-time cadence.
Q1 (audit prep). Auditor selected and engaged for the upcoming audit. Audit data scope and methodology agreed. Data extract scripts run against the prior 12 months and validated.
Q2 (audit execution). Auditor receives the data, conducts the analysis, asks clarifying questions. Vendor and employer respond promptly. Audit findings shared in draft.
Q3 (audit publication and remediation). Final audit report issued. Public summary published on employer websites. If findings include impact ratios below 0.80, remediation plan developed (model adjustments, threshold changes, expanded candidate sourcing, etc.).
Q4 (next-year planning). Lessons from the current audit feed into the next year's tool changes. Continuous monitoring catches issues that would show up in the next audit, fixes them before they show up.
The discipline of a continuous monitoring + annual external audit cycle is what produces audits that consistently pass and platforms that survive litigation. The fire-drill audit cycle, where the team scrambles to produce data when the auditor arrives, produces the patterns of failure documented earlier.
Build order
Audit readiness for NYC LL 144, Illinois HB 3773, and the Colorado AI Act is a dependency chain, not a calendar. Each step below produces the artifact the next step needs; skipping ahead leaves the auditor without something they will ask for.
| Order | What you build | Eval gate before moving on |
|---|---|---|
| 1 | Tool version registry and per-candidate evaluation records (candidate ID, job ID, employer ID, version, input features, output score, threshold, disposition, timestamp) | 100% of scoring events in the last 30 days carry a resolvable version ID and a candidate ID that joins to the ATS |
| 2 | Vendor data export schema covering the audit period, with one row per evaluation and version metadata attached | Dry-run extract for a 90-day window returns within SLA and reconciles to within 1% of production scoring counts per version |
| 3 | Demographic data linkage layer (ATS-side join or vendor-side audit space, with access controls blocking demographic fields from training and inference) | Demographic coverage above 80% on the audit window, zero demographic fields reachable from model input pipelines (verified by access-log audit) |
| 4 | Outcome capture from ATS back to the AEDT record (advanced, not advanced, hired, withdrawn) | At least 95% of evaluations from the last 60 days have a final disposition recorded within 7 days of the ATS event |
| 5 | Impact ratio dataset and intersectional disparity metrics computed weekly (selection rate, four-fifths impact ratio per EEO-1 group, and Hispanic-female, Black-female, Asian-male style intersectional cuts) | Dashboard reproduces the LL 144 summary table for any tool version on demand, including intersectional rows above the 2% sample threshold |
| 6 | Candidate-evaluation traces wired into alerting (sub-0.80 impact ratio, demographic coverage drops, outcome capture lag, scoring drift across versions) | Alert fires within 24 hours of a synthetic 0.79 intersectional impact ratio injected into the prior week's data |
After step 6 the annual external audit becomes a confirmation of what the dashboards already show, and the auditor's data request maps to existing queries. Skipping order, especially shipping monitoring before versioning or running the audit before outcome capture is wired, produces the exact failure patterns documented earlier: orphaned versions, broken intersectional cuts, and selection rates measured against recommendations rather than actual selections.
How Respan fits
Bias audits are only as defensible as the per-candidate evaluation records, version metadata, and outcome data feeding them. Respan is the substrate that captures every AEDT scoring event with the structure auditors and continuous monitoring both depend on.
- Tracing: every candidate evaluation captured as one connected trace. Auto-instrumented for LangChain, LlamaIndex, Vercel AI SDK, CrewAI, AutoGen, OpenAI Agents SDK. Spans record tool version, input features, output score, threshold applied, and disposition so the audit data extract is a query rather than a forensic exercise.
- Evals: ten built-in evaluators (faithfulness, citation accuracy, refusal correctness, harmfulness) plus LLM-as-judge and custom Python evaluators. Production traffic flows directly into datasets. CI-aware experiments block regressions on selection rate drift, sub-0.80 impact ratios, and intersectional disparity before deploys ship.
- Gateway: 500+ models behind an OpenAI-compatible interface, semantic caching, fallback chains, per-customer spending caps. Routing decisions and model versions are logged alongside scoring events so you can isolate which model variant drove a given period's selection rates.
- Prompt management: versioned registry, dev/staging/prod environments with approval workflows, A/B testing in production with one-click rollback. Scoring rubrics, screening prompts, and rationale templates carry audit-ready version IDs that line up with the tool versioning your auditor will request.
- Monitors and alerts: weekly selection rate, four-fifths impact ratio per protected group, intersectional impact ratio, demographic data coverage, and outcome capture lag. Slack, email, PagerDuty, webhook. Threshold breaches reach engineering before they reach the public summary on an employer's website.
A reasonable starter loop for AI recruiting builders:
- Instrument every LLM call with Respan tracing including scoring spans, threshold spans, and human-in-the-loop override spans.
- Pull 200 to 500 production candidate evaluations into a dataset and label them for advance/no-advance correctness, rationale quality, and protected-attribute leakage.
- Wire two or three evaluators that catch the failure modes you most fear (sub-0.80 impact ratio on intersectional groups, demographic data leaking into model inputs, scoring drift across tool versions).
- Put your screening prompts and rationale templates behind the registry so you can version, A/B, and roll back without a deploy.
- Route through the gateway so model swaps and fallbacks stay versioned and auditable instead of silently changing scoring behavior mid-period.
The goal is a continuous monitoring loop where the annual external audit is the confirmation of what your dashboards already showed, not the first time anyone measured impact ratio at scale.
To wire any of the patterns above on Respan, start tracing for free, read the docs, or talk to us.
Related reading
- The Eightfold FCRA Lawsuit and What Algorithmic Hiring Engineers Need to Ship Now: the regulatory environment driving audit demand
- Evaluating Recruiting LLMs: match quality, calibration, and adverse impact
- Building an AI Sourcing and Screening Agent: full architecture walkthrough
- How HR Tech Teams Build LLM Apps in 2026: pillar overview
