The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Generic AI assistants confidently decide when evidence is missing and fail most inconclusive unemployment-adjudication cases, scoring about 15% on insufficient-information instances; a structured prompting checklist (SPEC) boosts accuracy to 89% while correctly deferring decisions when facts are inadequate.

Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication
Mohamed Afane, Emily Robitschek, Derek Ouyang, Daniel E. Ho · April 21, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
In unemployment-insurance adjudication benchmarks derived from Colorado training materials, off-the-shelf RAG approaches are highly presumptuous on inconclusive cases (≈15% accuracy), while a structured prompting method (SPEC) raises overall accuracy to 89% and appropriately defers when evidence is insufficient.

A well-known limitation of AI systems is presumptuousness: the tendency of AI systems to provide confident answers when information may be lacking. This challenge is particularly acute in legal applications, where a core task for attorneys, judges, and administrators is to determine whether evidence is sufficient to reach a conclusion. We study this problem in the important setting of unemployment insurance adjudication, which has seen rapid integration of AI systems and where the question of additional fact-finding poses the most significant bottleneck for a system that affects millions of applicants annually. First, through a collaboration with the Colorado Department of Labor and Employment, we secure rare access to official training materials and guidance to design a novel benchmark that systematically varies in information completeness. Second, we evaluate four leading AI platforms and show that standard RAG-based approaches achieve an average of only 15% accuracy when information is insufficient. Third, advanced prompting methods improve accuracy on inconclusive cases but over-correct, withholding decisions even on clear cases. Fourth, we introduce a structured framework requiring explicit identification of missing information before any determination (SPEC, Structured Prompting for Evidence Checklists). SPEC achieves 89% overall accuracy, while appropriately deferring when evidence is insufficient -- demonstrating that presumptuousness in legal AI is systematic but addressable, and that doing so is a necessary step towards systems that reliably support, rather than supplant, human judgment wherever decisions must await sufficient evidence.

Summary

Main Finding

AI systems used for legal adjudication are systematically presumptuous: they often give confident yes/no rulings even when legally operative facts are missing. Standard RAG-based pipelines averaged only 15% accuracy on cases with insufficient information. A structured prompting framework—SPEC (Structured Prompting for Evidence Checklists)—that requires explicit identification of missing facts before any determination both eliminates this presumptuousness and resolves the determination–deferral tradeoff, achieving 89% overall accuracy while appropriately deferring when evidence is insufficient.

Key Points

  • Presumptuousness problem: LLMs commonly produce confident determinations despite evidence gaps; in the unemployment insurance (UI) domain this leads to inappropriate approvals/denials with severe downstream harms (overpayments, wrongful garnishments, litigation, and real-world human costs).
  • New benchmark / dataset: The authors built the first adjudication dataset that systematically controls the presence/absence of legally operative facts, using official Colorado Department of Labor and Employment (CDLE) adjudication guidance and statutes.
  • Baseline performance: Four leading AI platforms, given identical statutory and CDLE guidance via RAG, averaged 15% accuracy on cases where information was insufficient (i.e., models should have deferred).
  • Prompting tradeoff: Advanced prompting techniques (chain-of-thought, multi-agent prompting, etc.) can increase correct deferrals on inconclusive cases but tend to over-correct—i.e., they withhold decisions even when cases are clearly decidable—creating a harmful coverage/accuracy tradeoff.
  • SPEC framework: A multi-stage, structured prompting pipeline that (a) extracts facts from the claim, (b) derives the checklist of legal requirements from retrieved statutes/guidance, (c) explicitly flags which requirements are missing or ambiguous, and only then (d) issues a determination or returns “Inconclusive” with an itemized list of missing facts. SPEC achieves 89% overall accuracy and appropriate deferral behavior.
  • Conceptual insight: Combining neural flexibility (to interpret soft guidance and contextual judgments) with symbolic-style checklists (to enforce explicit gap detection) captures the practical strengths of both approaches for legal adjudication.

Data & Methods

  • Domain and collaboration: Unemployment insurance adjudication; close collaboration with the Colorado Department of Labor and Employment provided access to internal adjudication guides and training materials (covering voluntary quit, discharge for misconduct, availability, etc.), enabling realistic operationalization of what facts are legally required.
  • Benchmark construction: Cases were synthesized / curated to create three classes: clearly eligible, clearly ineligible, and inconclusive (missing one or more legally operative facts). The dataset explicitly controls which statutory requirements are present/absent.
  • Retrieval and grounding: All evaluated systems received identical retrieved context (Colorado Revised Statutes, administrative regs, CDLE guidance) via standard retrieval-augmented generation (RAG).
  • Systems evaluated: Four leading AI platforms (unnamed in the excerpt) were evaluated under (a) baseline RAG prompting, (b) advanced prompting techniques (CoT, multi-agent/iterative methods), and (c) SPEC.
  • Metrics and outcomes:
    • Baseline RAG: ~15% accuracy on insufficient-information (inconclusive) cases (i.e., models still made yes/no decisions when they should have deferred).
    • Advanced prompting: Improved recognition of inconclusive cases but caused over-abstention—deferred on many clear cases.
    • SPEC: 89% overall accuracy, with proper deferral on insufficient-information cases and correct determinations when facts were present.
  • SPEC procedure (high level): extract facts F from the query; derive requirement checklist R from legal corpus C; check F against R to identify missing items; if any legally required item is missing, output "Inconclusive" plus the checklist of required facts to collect; otherwise, render a determination with supporting reasons.

Implications for AI Economics

  • Value vs. risk of automation: The economic case for AI in administrative adjudication depends critically on the system’s ability to abstain appropriately. Presumptuous systems may increase throughput short-term but impose large expected costs via erroneous determinations (appeals, overpayments, litigation, reputational and human costs). SPEC-like abstention reduces those risks and improves net economic value of automation.
  • Productivity tradeoffs and staffing impacts: Proper deferral shifts work composition—fewer mistaken determinations and fewer appeals, but more targeted human fact-finding when the model flags missing items. Agencies could reallocate labor from routine determinations toward higher-value evidentiary collection and adjudicative judgment, potentially increasing throughput without sacrificing accuracy.
  • Incentives for vendors and agencies: Market demand will favor tools that can demonstrate safe abstention and transparent missing-fact signaling (reducing automation bias and enabling effective human oversight). Procurement and regulation should prioritize abstention-capable designs rather than raw accuracy on fully specified cases.
  • Measurement and benchmarking: Standard AI/economic evaluations of legal automation should include metrics for information-sufficiency recognition (coverage conditional on correctness, false-defer and false-decision rates), not only accuracy on fully specified inputs. Benchmarks that ignore insufficiency risk encouraging models that are efficient but hazardous in deployment.
  • Externalities and regulation: Given high-stakes social costs (as in historical UI automation failures), regulators and auditors should require:
    • explicit abstention mechanisms and audit logs of missing-fact checks,
    • standardized benchmarks for insufficiency recognition,
    • transparency of retrieval sources and checklists used to ground determinations.
  • Generalizability and market expansion: While developed for UI adjudication, the SPEC pattern—structured checklists + neural interpretation—applies to other high-stakes bureaucratic and regulatory decisions (social benefits, immigration, licensing). Economically, scalable adoption across domains can yield large system-wide efficiency gains only if abstention behavior is properly handled.
  • Research & investment priorities: From an AI-economics perspective, investing in selective-prediction mechanisms, human–AI workflow design, and dataset/benchmarks that model incomplete information will yield higher social returns than further marginal gains in closed-data accuracy.

Summary statement: The paper shows that the economic and social benefits of AI in legal-administrative settings depend much more on knowing when not to decide than on raw decision accuracy. SPEC provides a practical, empirically validated design pattern that restores safe abstention and improves the economics of deployment by reducing costly misdecisions while preserving beneficial automation.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The study uses a carefully constructed benchmark derived from official state training materials and compares multiple baseline and advanced prompting approaches across four leading platforms, giving credible within-sample evidence about model behavior; however, it lacks real-world deployment/field validation, external replication across jurisdictions and models, and does not link model behavior to downstream economic outcomes, limiting causal or external claims. Methods Rigormedium — Strengths include collaboration with the state agency, access to ground-truth training material, systematic variation of information completeness, multiple baselines (RAG) and interventions (advanced prompting, SPEC), and clear accuracy metrics; weaknesses include no reported sample size or statistical uncertainty in the summary, potential benchmark design choices that could bias results, evaluation limited to four platforms and offline cases rather than live adjudication, and unclear robustness checks. SampleA novel benchmark constructed from official Colorado Department of Labor and Employment training materials and guidance, containing adjudication cases that systematically vary information completeness (clear, inconclusive/missing evidence, etc.); evaluated on four leading AI platforms using standard RAG retrieval, advanced prompting variants, and the proposed SPEC structured prompting; labels and correctness judged per the state's adjudication guidance. Themeshuman_ai_collab governance adoption GeneralizabilitySingle-jurisdiction (Colorado) training materials may not reflect other states' or countries' adjudication rules and procedures, Benchmark cases are simulated/derived from guidance and may not capture the full heterogeneity and noise of real administrative records or hearing transcripts, Only four AI platforms were evaluated; results may differ for other LLMs, different retrieval/indexing setups, or future model versions, Offline benchmark evaluation may not reflect model behavior under deployment conditions (user interactions, chained reasoning, adversarial inputs), Performance may depend on benchmark construction and labeler decisions; ecological validity to actual decision bottlenecks is uncertain

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
A well-known limitation of AI systems is presumptuousness: the tendency of AI systems to provide confident answers when information may be lacking. Decision Quality negative high tendency to provide confident answers when information is lacking (presumptuousness)
0.18
Unemployment insurance adjudication has seen rapid integration of AI systems and the question of additional fact-finding poses the most significant bottleneck for a system that affects millions of applicants annually. Social Protection negative high scale of impact (number of applicants affected) and fact-finding bottleneck in adjudication
millions of applicants annually
0.18
Through a collaboration with the Colorado Department of Labor and Employment, we secured access to official training materials and guidance to design a novel benchmark that systematically varies information completeness. Other positive high creation of a benchmark varying information completeness
0.18
Evaluation of four leading AI platforms shows that standard RAG-based approaches achieve an average of only 15% accuracy when information is insufficient. Decision Quality negative high accuracy on cases where information is insufficient (inconclusive cases)
n=4
15% accuracy
0.18
Advanced prompting methods improve accuracy on inconclusive cases but over-correct, withholding decisions even on clear cases. Decision Quality mixed high accuracy on inconclusive cases and rate of withholding/deferral on clear cases
0.18
We introduce SPEC (Structured Prompting for Evidence Checklists), a structured framework requiring explicit identification of missing information before any determination. Other positive high framework implementation that forces evidence-checklist and missing-information identification
0.18
SPEC achieves 89% overall accuracy, while appropriately deferring when evidence is insufficient. Decision Quality positive high overall accuracy and appropriate deferral on insufficient-evidence cases
89% overall accuracy
0.18
Presumptuousness in legal AI is systematic but addressable, and addressing it is a necessary step towards systems that reliably support, rather than supplant, human judgment wherever decisions must await sufficient evidence. Decision Quality positive high reliability of AI systems to support human judgment under insufficient evidence conditions
0.03

Notes