You Cannot Trust a Single AI Model for Detailed Research.

Jason Pistulka
June 11, 2026
Blog
0

I spent several days writing a compliance article about AI in hiring. The subject matter was specific: the legal requirements employers face when deploying AI tools in their recruiting process. New York City Local Law 144. Illinois AIVIA and the Human Rights Act amendments. Colorado SB26-189. California’s overlapping CPPA and Civil Rights Council regulations. Connecticut SB 5, which was signed into law three days before I published.

To research it, I used four AI models simultaneously: paid Claude, paid ChatGPT, free Gemini, and free Grok.

Every single one of them got something wrong. No single model caught everything. Some errors survived two rounds of AI review before a third model caught them. One correction introduced a new error. And throughout all of it, every model delivered its answers in the same confident, authoritative tone — whether it was right or not.

This article documents exactly what happened: which model produced which errors, which model caught which corrections, and what the cumulative picture tells you about using AI for high-stakes professional research.

Why This Matters Beyond One Article

Before getting into the specifics, it is worth establishing why the stakes are high enough to document this carefully.

The article I was writing covered AI hiring compliance law. The laws in question carry real penalties. NYC Local Law 144 fines stack by day and by candidate. Illinois has enforcement mechanisms through multiple state agencies. Colorado’s new ADMT law requires three-year record retention and 30-day adverse outcome explanations. California’s CPPA regulations require employers to maintain a human fallback track for candidates who opt out of AI screening.

An employer who reads a compliance article, relies on the legal details, and acts on them faces real consequences if those details are wrong. Getting the Illinois demographic reporting date wrong by two years means an employer thinks their compliance gap starts in 2024 when it actually starts in 2022. Getting the NYC audit requirements wrong means an employer believes a vendor’s general audit covers their obligations when it may not. Getting Colorado wrong means an employer is planning for a law that no longer exists in the form described.

These are not abstract errors. They have operational and financial consequences. And they were all in my first draft.

The Starting Point: What Claude Produced

The original article draft was written using paid Claude. It was well-structured, readable, and covered the right jurisdictions. It was also wrong in several important ways.

The NYC penalty figure was incorrect

The draft stated “$375 to $1,500 per violation, per day.” The correct figure is $500 for a first violation and $500 to $1,500 for subsequent violations. Not a catastrophic error on its own, but it made the cost model in the article mathematically wrong and would have been immediately spotted by any compliance professional.

The NYC audit requirement was overstated

The draft stated the audit “must be conducted on your specific deployment of the tool” and that a vendor’s general product audit “does not satisfy this requirement.” This was too absolute. DCWP guidance allows a vendor-coordinated audit to satisfy the requirement in certain circumstances — specifically where an independent auditor conducted it, the required calculations and summary are in place, and the employer either contributed historical data or is using the AEDT for the first time. The strategic point — that vendor marketing claims are not compliance — is correct. The specific legal claim was overstated.

The Illinois AIVIA demographic reporting date was wrong

The draft described the demographic reporting requirement as a “2024 amendment.” The requirement was added by Public Act 102-47 and took effect January 1, 2022. The confusion likely stems from the Illinois Department of Commerce publishing its first AI Demographic Data Analyses Report in 2024. An employer reading the article would believe their compliance gap starts in 2024 when it actually starts two years earlier.

The Illinois law citation was wrong

The draft referenced “Illinois SB 2930” as the operative AI employment law. The correct citation is Public Act 103-0804, enacted through HB 3773, amending the Illinois Human Rights Act.

The Illinois impact assessment requirement was overstated

The draft stated that Illinois law requires employers to “conduct and document impact assessments of AI tools used in employment decisions before deployment.” Illinois Public Act 103-0804 does not mandate this as a statutory requirement. It prohibits discriminatory AI use and requires notice. Impact assessments are strongly recommended by employment attorneys as a practical defense measure, but they are not a legal mandate.

The Colorado section described a law that no longer existed

The draft described Colorado SB 205 as effective February 1, 2026, with broad high-risk AI impact assessment requirements. Colorado’s original SB 24-205 was repealed and reenacted by SB26-189, signed May 14, 2026 — three weeks before publication. The new law uses ADMT framing, pushes the effective date to January 1, 2027, and replaces the sweeping impact assessment mandate with targeted notice, human review rights, and record-keeping obligations.

The California section was thin and partially outdated

The draft mentioned California briefly under existing discrimination law. It did not address the California Privacy Protection Agency’s ADMT regulations, effective January 1, 2026, which give California applicants the right to opt out of automated screening entirely and require employers to maintain a human fallback track. This is arguably the most operationally disruptive AI hiring requirement currently active in any jurisdiction — and it was missing.

The ADA was missing entirely

The federal section covered Title VII and FCRA but did not mention the Americans with Disabilities Act. AI video interview tools that analyze facial expressions, speech patterns, or physical movement face specific ADA scrutiny. This was a gap, not a minor omission.

The FCRA framing lacked the strongest available authority

The FCRA section was directionally correct but did not reference CFPB Circular 2024-06, which directly addresses background dossiers, algorithmic scores, and third-party consumer reports used for hiring decisions. That circular is the strongest current federal authority on this question.

Round One: What Grok Caught

The first external model review was run through Grok. It caught two material errors.

The NYC penalty figure. Grok correctly identified that $375 was not the right number, confirmed the $500 first-violation floor, and noted that the cost model math needed to be updated accordingly.

The Illinois AIVIA demographic reporting date. Grok correctly identified that the requirement was added effective January 1, 2022, not 2024, and traced the confusion to the Department of Commerce’s 2024 published report.

Grok did not catch the overstated NYC audit requirement, the wrong Illinois law citation, the overstated impact assessment mandate, the outdated Colorado section, the missing California CPPA regulations, the missing ADA section, or the FCRA framing gap. At this point, two errors were corrected. Eight material issues remained.

Round Two: What ChatGPT Caught

The second external review was run through paid ChatGPT. This was the most detailed and sourced review of the process, and it caught the largest number of remaining errors.

NYC audit requirement. Correctly identified the overstatement and provided the DCWP FAQ as the authoritative source for the multi-employer and first-use nuances.

NYC notice workflow. Identified that standing website notice allows AEDT use 10 business days after posting regardless of when specific candidates apply. The per-applicant delay is not mandatory if notice is structured correctly.

NYC penalty math methodology. Flagged that penalties accrue by day of noncompliant use with notice failures accruing separately — not as a per-candidate multiplication.

Illinois AIVIA demographic reporting trigger. Narrowed the obligation to employers who rely solely on AI video analysis to determine in-person interview selection — a meaningful distinction from all AI video interview users.

Illinois private right of action. Flagged that AIVIA does not contain an express private right of action, and recommended removing or qualifying the claim.

Illinois law citation. Confirmed the correct citation is Public Act 103-0804 through HB 3773, not SB 2930.

Illinois impact assessment requirement. Confirmed this was not a statutory mandate and recommended reframing as a strongly recommended governance practice.

Colorado section. Confirmed the entire section needed to be rewritten to reflect SB26-189, including the ADMT framing, January 1, 2027 effective date, 30-day post-adverse-outcome explanation requirement, three-year record retention, and developer documentation requirements.

California section. Flagged that the Civil Rights Council regulations and the CPPA ADMT regulations are two separate frameworks, and that the opt-out right was the more operationally significant missing piece.

ADA. Added as a required element of the federal section, with specific reference to AI video interview tools and timed assessments as areas of heightened scrutiny.

FCRA framing. Provided CFPB Circular 2024-06 as the authoritative anchor and tightened the framing around what actually triggers FCRA obligations.

A Note on Corrections That Introduced New Errors

One of the more instructive moments in this process was when a correction itself became an error.

After receiving ChatGPT’s feedback, I revised the Illinois section to more accurately describe Public Act 103-0804. In doing so, I wrote that “employers must conduct and document impact assessments of AI tools used in employment decisions before deployment” as part of the statutory requirements. This was incorrect. The law prohibits discriminatory outcomes and requires notice. It does not mandate impact assessments.

This error did not come from the original draft. It was introduced during the correction process, likely because I was synthesizing feedback from multiple sources and misattributed a recommended practice as a statutory requirement.

Running multiple models does not just help you catch original errors. It also helps you catch errors you introduce while fixing other errors.

Gemini caught it in the next review pass. The error lived in the article for one full revision cycle before it was identified.

Round Three: What Gemini Caught

The third external review was run through free Gemini. By this point, the most significant errors had been corrected. Gemini’s contribution was different in character — less about catching wrong information and more about identifying what was missing.

Connecticut SB 5. Flagged that Connecticut SB 5 had been signed into law on May 29, 2026 — three days before publication. Core provisions take effect October 1, 2026. The law regulates automated employment-related decision technology and includes a WARN Act AI disclosure requirement for mass layoff notices. None of this was in the article.

California CPPA opt-out right. Provided the specific operational detail that California applicants have the right to opt out of automated screening entirely, and that employers must maintain a human fallback track for opt-out candidates. A fully automated recruiting funnel with no human alternative is non-compliant for California roles.

New Jersey empirical validation burden. Added that under New Jersey’s December 2025 rules, if an automated screening tool creates a demographic disparity, the employer carries the burden to prove job-relatedness via empirical validation studies. Vendor assurances are explicitly not sufficient.

Maryland facial recognition restriction. Added Maryland’s 2020 law requiring formal written applicant consent before any facial recognition or behavioral AI can analyze a video interview recording.

The patchwork framing. Contributed the analytical frame now in the article’s closing: a single software trigger can simultaneously require a 10-business-day notice window in NYC, a human-only review track in California, written consent in Illinois and Maryland, a documented appeals process in Colorado, and a WARN Act disclosure in Connecticut.

Round Four: Grok’s Final Pass

After all corrections were integrated, a final pass through Grok confirmed no material inaccuracies remained. Grok noted two minor refinements — the Illinois demographic reporting scope and additional specificity on ADA scrutiny of video tools — both of which were incorporated.

The article was published after this fourth review cycle.

The Error Log at a Glance

Model	Tier	Errors Caught	Errors Missed / Introduced
Claude	Paid	Drafted original article	9 material errors in first draft
Grok	Free	NYC penalty figure, Illinois date (2 errors)	8 material errors remained
ChatGPT	Paid	9 errors: audit nuance, notice workflow, penalty math, IL reporting scope, private right of action, law citation, impact assessment, Colorado rewrite, California gap, ADA, FCRA	1 new error introduced in correction
Gemini	Free	Connecticut SB 5, California opt-out right, NJ burden shift, Maryland restriction, patchwork framing	Caught correction error from prior pass
Grok	Free	Final verification, 2 minor refinements	Zero remaining material errors

What This Tells You About AI Research

Confidence is not accuracy. Every model delivered its answers in the same authoritative tone. There was no signal in the output indicating whether the model was correct or uncertain. The wrong Illinois law citation read exactly like the correct one. The outdated Colorado section read exactly like current law. If you do not know what you are looking for, you will not know when the answer is wrong.

Paying more does not guarantee accuracy. The original draft with material errors came from paid Claude. The initial correction pass missed eight issues. Paid ChatGPT produced the most thorough review, but free Gemini caught things ChatGPT missed. Free Grok caught things that a paid model produced in error. Tier and price do not correlate with reliability on specific factual claims.

Models have different research strengths. ChatGPT was strongest on systematic legal analysis — going through the article clause by clause. Gemini was strongest on current events — catching legislation signed three days before publication. Grok was strongest on specific factual verification. Claude was strongest on structure, synthesis, and drafting. Using each for what it does best produces better results than using any single model for everything.

Corrections can introduce errors. One error in the revised article was not in the original draft. It appeared during the correction process. Multi-model review is not just useful for catching original errors. It is necessary for catching errors that the revision process creates.

Primary sources still matter. Several of the most important corrections were anchored to specific primary sources: the DCWP FAQ, Illinois General Assembly Public Act pages, Colorado General Assembly bill text, CFPB Circular 2024-06. AI models that cited specific sources were more likely to be accurate and more verifiable.

The Practical Workflow That Actually Works

Based on this experience, here is the research workflow for high-stakes professional content.

Draft with one model. Pick the model strongest at synthesis and structure for your content type.
Review with a second model focused on factual verification. Ask it to check every specific legal claim, citation, date, and statutory requirement against current sources. Ask it to cite sources for corrections.
Review with a third model focused on gaps. Ask what is missing, what has changed recently, and what the article gets wrong about the current state of the topic.
Run a final pass with a fourth model. Verify the corrections from prior passes and flag anything overstated or imprecise.
Verify the highest-stakes claims against primary sources. For legal and compliance content, AI models are research accelerators, not replacements for primary source verification. Statute text, agency guidance, and regulatory FAQs exist and are searchable.
Get professional review before publishing anything with compliance implications. Nothing any AI model produces about specific legal requirements should be treated as legal advice without qualified counsel review.

The Bottom Line

I am an advisor who works at the intersection of AI, HR technology, and organizational governance. I spend significant professional time evaluating AI tools and helping organizations deploy them responsibly. I used four AI models with genuine research effort, ran multiple correction cycles, and still needed all four to catch everything that needed catching.

If that is the experience of someone who does this professionally, the gap between what organizations think they are getting from AI research and what they are actually getting should concern everyone.

AI is a powerful research accelerator. It is not a reliable single source of truth on anything where accuracy has real consequences. The multi-model approach described here is not a workaround. It is the minimum viable research process for high-stakes professional content.

Use one model to draft. Use three more to challenge it. Verify the critical claims yourself. And never mistake confident prose for accurate information.

Jason Pistulka is the Founder and Principal of StratTech Talent Consulting and Advisory LLC, a Tennessee-based boutique consulting firm focused on enterprise Talent Acquisition strategy, HR technology architecture, recruiting operations, and AI governance.

Nothing in this article constitutes legal advice. Employers should consult qualified employment counsel before making compliance decisions related to AI hiring tools.

Tags: AI for legal research AI hallucination risk AI hiring compliance AI research accuracy ChatGPT vs Claude vs Gemini vs Grok HR Technology multi-model AI workflow prompt engineering best practices

You Cannot Trust a Single AI Model for Detailed Research.

Why This Matters Beyond One Article

The Starting Point: What Claude Produced

The NYC penalty figure was incorrect

The NYC audit requirement was overstated

The Illinois AIVIA demographic reporting date was wrong

The Illinois law citation was wrong

The Illinois impact assessment requirement was overstated

The Colorado section described a law that no longer existed

The California section was thin and partially outdated

The ADA was missing entirely

The FCRA framing lacked the strongest available authority

Round One: What Grok Caught

Round Two: What ChatGPT Caught

A Note on Corrections That Introduced New Errors

Round Three: What Gemini Caught

Round Four: Grok’s Final Pass

The Error Log at a Glance

What This Tells You About AI Research

The Practical Workflow That Actually Works

The Bottom Line

Hours

Company

Our Services

Subscribe Newsletter

You Cannot Trust a Single AI Model for Detailed Research.

Why This Matters Beyond One Article

The Starting Point: What Claude Produced

The NYC penalty figure was incorrect

The NYC audit requirement was overstated

The Illinois AIVIA demographic reporting date was wrong

The Illinois law citation was wrong

The Illinois impact assessment requirement was overstated

The Colorado section described a law that no longer existed

The California section was thin and partially outdated

The ADA was missing entirely

The FCRA framing lacked the strongest available authority

Round One: What Grok Caught

Round Two: What ChatGPT Caught

A Note on Corrections That Introduced New Errors

Round Three: What Gemini Caught

Round Four: Grok’s Final Pass

The Error Log at a Glance

What This Tells You About AI Research

The Practical Workflow That Actually Works

The Bottom Line

Hours

Company

Our Services

Subscribe Newsletter

Social Share: