Benchmark Methodology

How we test AI models against the ZH-1 deterministic engine. Published so anyone can evaluate our claims.

1. What Is Tested

The benchmark evaluates compliance accuracy across five domains:

  • SNAP / Food Assistance — eligibility rules including ABAWD time limits, income thresholds, categorical eligibility (7 USC §2011-2036)
  • Medicaid — expansion enrollment, work requirements, immigration eligibility (42 USC §1396 et seq.)
  • Housing (HUD) — Section 8 HCV eligibility, PHA admissions, HUD-VASH (42 USC §1437f; 24 CFR §982)
  • Federal Grants — Uniform Guidance requirements, single audit thresholds, indirect cost rates (2 CFR §200)
  • Citation Verification — USC/CFR title validation, Public Law congress numbers, DOI resolution

2. How ZH-1 Answers

ZH-1 is not a language model. It is a rule engine that evaluates encoded federal regulations. Each rule references a specific statutory or regulatory source (e.g., 7 USC §2015(o)). The engine produces deterministic output: same input, same rules, same result. There is no probability, sampling, or generation involved.

3. How AI Models Are Tested

Each test scenario is sent to the AI model's API as a direct question. The model's response is parsed and evaluated against the expected answer from the rule engine. Models are tested via their official APIs (Anthropic, OpenAI, Google, Groq, xAI) using the latest available model versions.

4. Scoring Rubric

ClassificationCriteria
CorrectResponse matches the rule engine answer AND cites the applicable statute or regulation.
HallucinationResponse contradicts the rule engine answer, cites a nonexistent statute, or fabricates regulatory requirements.
PartialResponse is directionally correct but missing key requirements, exemptions, or statutory references.

5. Variance Measurement

Each question is sent to each model N times (typically 5-10 runs). Variance measures the standard deviation of correctness across runs. A variance of 0% means the model gives the same answer every time. High variance indicates the model is unreliable even when it occasionally gets the right answer. ZH-1 always has 0% variance because it is deterministic.

6. Limitations

  • “100%” means 100% of currently encoded rules, not 100% of all regulatory questions. ZH-1 can only answer questions covered by its rule engine.
  • Rule coverage is expanding. The current rule set covers SNAP, Medicaid, HUD, and federal grants. Medical compliance is in development.
  • Benchmark scenarios are designed to test common compliance questions where AI models are known to fail. They are not exhaustive.
  • AI model scores reflect a point-in-time snapshot. Models are updated frequently and scores may change.

7. Benchmark vs. Live

On the scoreboard, models with a EST. badge show scores derived from published third-party benchmarks (e.g., HLE, MMLU), mapped to our scoring rubric. Models without the badge have been directly tested against the ZH engine with live API calls. We clearly label which is which.

8. Published Test Scenarios

Below are 20 representative test scenarios. Each shows the question asked, the expected correct answer with statutory basis, and why AI models commonly get it wrong.

#1SNAP / Food Assistance
Is a 55-year-old able-bodied adult without dependents, who is not working, eligible for SNAP?
Expected: Subject to ABAWD time limits. Under P.L. 119-21, ABAWD provisions now apply to ages 18-64. Without meeting work requirements, benefits are limited to 3 months in a 36-month period unless the state has a waiver.
Statute: 7 USC §2015(o); P.L. 119-21 §201
Why AI fails: Most models cite the pre-2025 ABAWD age range (18-49 or 18-54) because their training data predates P.L. 119-21.
#2Medicaid
Does a Medicaid expansion enrollee need to meet work requirements?
Expected: Yes. P.L. 119-21 adds an 80-hour/month work requirement for Medicaid expansion adults, effective December 2026. Exemptions exist for pregnant women, primary caregivers of dependents under 7, and individuals medically certified as unable to work.
Statute: 42 USC §1396a; P.L. 119-21 §211
Why AI fails: Models trained before 2025 are unaware of the new work requirement and state expansion is unconditional.
#3Housing (HUD)
What is the income limit for a family of 4 applying for Section 8 Housing Choice Vouchers?
Expected: Must not exceed 50% of area median income (AMI) for the jurisdiction. Exact dollar amounts vary by county/metro area and are published annually by HUD.
Statute: 42 USC §1437f(o)(4); 24 CFR §982.201
Why AI fails: Models frequently hallucinate specific dollar amounts instead of citing the AMI-based formula, or confuse Section 8 limits with public housing limits.
#4Federal Grants
Can a nonprofit with $500K annual revenue apply for HRSA Community Health Center grants?
Expected: Yes. HRSA CHC grants (Section 330) require applicant to be a public or private nonprofit, serve a medically underserved area/population, and provide services regardless of ability to pay. No minimum revenue threshold.
Statute: 42 USC §254b; 2 CFR §200
Why AI fails: Models often fabricate minimum revenue requirements or confuse HRSA eligibility with SBA size standards.
#5Citation Verification
Is Title 53 of the United States Code a valid citation?
Expected: No. Title 53 does not exist in the USC. Valid titles are 1-52 and 54.
Statute: 1 USC §1 (Organization of USC)
Why AI fails: Models accept any plausible-sounding USC title number without validating against the actual title list.
#6SNAP / Food Assistance
What are ABAWD time limits under current law?
Expected: Able-bodied adults without dependents ages 18-64 are limited to 3 months of SNAP benefits in a 36-month period unless they meet work requirements of 80 hours/month or participate in qualifying training/workfare.
Statute: 7 USC §2015(o); P.L. 119-21 §201
Why AI fails: Models cite the old age range (18-49 or 18-54) and old work requirement hours (20 hrs/week instead of 80 hrs/month).
#7Housing (HUD)
Can an applicant be denied Section 8 for having a criminal record?
Expected: PHAs have discretion. Federal law only mandates denial for lifetime registered sex offenders and persons convicted of manufacturing methamphetamine on federally assisted property. Other criminal history is at PHA discretion per their admission policies.
Statute: 42 USC §1437n(f); 24 CFR §982.553
Why AI fails: Models either overstate (all felonies are disqualifying) or understate (criminal history cannot be considered) the actual policy.
#8Federal Grants
What is the single audit threshold for federal grant recipients?
Expected: $1,000,000 in federal awards expended during the fiscal year triggers a single audit requirement under the Uniform Guidance.
Statute: 2 CFR §200.501
Why AI fails: Models commonly cite the old $750,000 threshold from before the 2024 revision.
#9Citation Verification
Is 42 CFR §438 a valid regulatory citation?
Expected: Yes. 42 CFR Part 438 covers Managed Care requirements for Medicaid.
Statute: 42 CFR §438
Why AI fails: Models may incorrectly flag valid CFR citations or fail to identify the specific regulatory content.
#10SNAP / Food Assistance
Can SSI and SNAP benefits be received simultaneously?
Expected: Yes. SSI recipients are categorically eligible for SNAP in most states. In California, SSI recipients receive a state supplement in lieu of SNAP (cash-out policy).
Statute: 7 USC §2014(a); 7 CFR §273.2(j)
Why AI fails: Models often state SSI categorically includes SNAP (confusing categorical eligibility with automatic enrollment) or miss the California exception.
#11Medicaid
Does Medicaid cover undocumented immigrants?
Expected: Federal Medicaid does not cover most undocumented immigrants. Exception: Emergency Medicaid covers emergency medical conditions regardless of immigration status. Some states fund coverage for additional populations using state-only dollars.
Statute: 42 USC §1396b(v); 8 USC §1611
Why AI fails: Models frequently oversimplify to a blanket 'no' or incorrectly state all immigrants qualify under expansion.
#12Housing (HUD)
What is the Fair Market Rent for a 2-bedroom unit?
Expected: Fair Market Rents are set annually by HUD for each metropolitan area and non-metropolitan county. There is no single national FMR. The applicable FMR depends on the specific geographic area.
Statute: 24 CFR §888.113
Why AI fails: Models hallucinate specific dollar amounts (e.g., '$1,200/month') rather than explaining FMR is area-specific.
#13Federal Grants
Can a for-profit company receive federal grants?
Expected: Yes, but eligibility depends on the specific program. Many grant programs are limited to nonprofits and government entities, but some (e.g., SBIR/STTR) specifically target for-profit small businesses. The NOFO for each opportunity specifies eligible applicant types.
Statute: 2 CFR §200.1 (definition of 'non-Federal entity'); specific program authorizing statutes
Why AI fails: Models frequently state categorically that for-profits cannot receive grants, ignoring SBIR/STTR and other programs.
#14Citation Verification
Is P.L. 121-5 a valid Public Law citation?
Expected: No. The 121st Congress has not yet convened. The current Congress is the 119th (2025-2027). Valid Public Law numbers reference congresses 1 through 119.
Statute: 1 USC §112
Why AI fails: Models accept future congress numbers as valid without checking against current congressional session.
#15SNAP / Food Assistance
What is the asset limit for SNAP eligibility?
Expected: The federal gross income limit is 130% FPL and net income limit is 100% FPL. Most states have adopted broad-based categorical eligibility (BBCE), eliminating the asset test. States without BBCE apply a $2,750 asset limit ($4,250 for elderly/disabled households).
Statute: 7 USC §2014(g); 7 CFR §273.8
Why AI fails: Models either cite the old $2,250 asset limit (pre-2024 adjustment) or fail to mention BBCE eliminates the asset test in most states.
#16Medicaid
Who is exempt from Medicaid work requirements under P.L. 119-21?
Expected: Exemptions include: pregnant women, primary caregivers of dependents under 7, individuals medically certified as physically or mentally unable to work, full-time students, and individuals already meeting requirements through other programs.
Statute: P.L. 119-21 §211
Why AI fails: Models are unaware of P.L. 119-21 exemptions entirely, or confuse them with Medicaid waiver work requirements from individual states.
#17Housing (HUD)
Can a PHA deny a Section 8 applicant for owing money to another PHA?
Expected: Yes. PHAs may deny admission if the applicant owes amounts to any PHA, or if the applicant has been terminated from any HCV program. This is at PHA discretion per their written admissions policies.
Statute: 24 CFR §982.552(c)(1)(v)
Why AI fails: Models often incorrectly state debts to other PHAs cannot be considered, or claim federal law prohibits this consideration.
#18Federal Grants
What is the indirect cost rate for new federal grant recipients without a negotiated rate?
Expected: New recipients without a negotiated indirect cost rate may use the de minimis rate of 15% of modified total direct costs (MTDC) per the Uniform Guidance.
Statute: 2 CFR §200.414(f)
Why AI fails: Models commonly cite the old 10% de minimis rate (pre-2024 revision). The rate was increased to 15% effective October 2024.
#19Citation Verification
Does a DOI of 10.1234/fake.2025.001 correspond to a real publication?
Expected: This DOI cannot be verified. The prefix 10.1234 is not registered with any known CrossRef member. A valid DOI should resolve to an actual publication record.
Statute: CrossRef DOI resolution standard (ISO 26324)
Why AI fails: Models cannot check DOI validity at inference time and will often state the citation 'appears valid' based on format alone.
#20Housing (HUD)
Is a 62-year-old disabled veteran eligible for both HUD-VASH and standard Section 8?
Expected: A veteran cannot hold both simultaneously. HUD-VASH combines a HCV with VA case management. If the veteran is already receiving HUD-VASH, they would not separately receive a standard HCV. If not enrolled in HUD-VASH, they may apply for standard Section 8 through the PHA waitlist.
Statute: 42 USC §1437f(o)(19); 24 CFR §982.102
Why AI fails: Models often state veterans can hold both vouchers simultaneously, or incorrectly claim HUD-VASH is separate from the HCV program.