Evidence (6507 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Productivity Remove filter

Existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions, and do not capture forecasting over continuous quantities.

Literature/benchmark critique asserted in the paper (argument that current benchmarks focus on simple judgmental formats and miss continuous numerical forecasting capabilities).

high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... scope/coverage of existing evaluation formats

Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

Empirical observations from QuantSightBench evaluation showing model calibration performance as a function of magnitude (paper statement noting sharp degradation and overconfidence at extremes).

high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... calibration / overconfidence of prediction intervals across magnitudes

The top performers Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), and GPT-5.4 (75.3%) all fall at least 10 percentage points short of the 90% coverage target.

Reported empirical coverage percentages from evaluation on QuantSightBench for the listed models (paper provides these percentage values).

high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... empirical coverage (prediction interval coverage) for specific models

None of the 11 evaluated frontier and open-weight models achieves the 90% coverage target.

Empirical evaluation on the newly introduced QuantSightBench benchmark across 11 frontier and open-weight models; models were assessed on empirical coverage of prediction intervals versus a 90% target (paper statement).

high negative QuantSightBench: Evaluating LLM Quantitative Forecasting wit... empirical coverage (prediction interval coverage)

The study identified significant implementation challenges including algorithmic bias, digital divide concerns, data privacy risks, and low technology readiness among HR teams in Tier 2 cities.

Synthesis of qualitative case study findings from 4 organizations plus survey responses (N=150) reporting barriers and risks encountered during adoption.

high negative A Study on the Effectiveness of Technology-Driven Recruitmen... implementation challenges / risks

Current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

Qualitative and quantitative analysis of errors observed across the DELEGATE-52 experiments (19 LLMs) showing sparse, high-severity, and silently introduced errors that accumulate over long workflows.

high negative LLMs Corrupt Your Documents When You Delegate error severity and silent corruption over time

Degradation severity is exacerbated by document size, length of interaction, or presence of distractor files.

Additional experiments and analyses varying document size, interaction length, and presence of distractor files reported in the paper showing increased degradation under these conditions.

high negative LLMs Corrupt Your Documents When You Delegate severity of document degradation / error rate

Agentic tool use does not improve performance on DELEGATE-52.

Additional experiments reported in the paper that compare plain LLM delegation vs. agentic tool-using configurations on DELEGATE-52 and find no performance improvement from agentic tool use.

high negative LLMs Corrupt Your Documents When You Delegate task performance on DELEGATE-52 (document quality/corruption)

Even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows.

Reported results from the experiment evaluating 19 LLMs on DELEGATE-52; these named models are highlighted and an average corruption fraction (25%) is reported at the end of long workflows.

high negative LLMs Corrupt Your Documents When You Delegate proportion of document content corrupted

Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation.

Large-scale experiment reported in the paper evaluating 19 LLMs on DELEGATE-52 long delegated workflows; observed document degradation across models.

high negative LLMs Corrupt Your Documents When You Delegate document degradation / output quality

Underreliance on AI might deprive software developers of potential gains in productivity and quality.

Stated in the paper and motivated by themes from twenty-two developer interviews indicating missed benefits when developers underuse LLM tools.

high negative Towards an Appropriate Level of Reliance on AI: A Preliminar... productivity and output quality

Overreliance on AI may lead to long-term negative consequences (e.g., atrophy of critical thinking skills).

Paper explicitly states this risk and grounds the discussion in findings from twenty-two developer interviews (qualitative evidence and participant-reported concerns).

high negative Towards an Appropriate Level of Reliance on AI: A Preliminar... atrophy of critical thinking skills / skill degradation

Small and medium-sized practices face challenges of skill gaps and resource constraints that hinder adoption of technology and data analytics.

Consistent findings across included studies highlighting barriers in small and medium-sized practices (SMPs).

high negative The Use of Technology and Data Analytics in Modern Auditing:... ability to adopt and implement technology/data analytics

AI adoption is reinforcing existing structural disparities within the BRICS bloc, creating a two‑tier productivity hierarchy (China & India vs. Brazil, Russia & South Africa).

Observed divergence in TFP trajectories and differing links between AI indicators and TC/EC across the five BRICS economies; comparative analysis shows stronger frontier-shifting effects in China and India and weaker or negative effects in the other three economies.

high negative AI-driven productivity dynamics in BRICS economies: Evidence... Cross-country divergence in Total Factor Productivity (TFP) growth and its compo...

Brazil, Russia, and South Africa experience stagnation or decline in both efficiency and technological advancement over 2005–2023.

Malmquist TFP decomposition (EC and TC) for each BRICS economy showing flat or negative trends in EC and TC for Brazil, Russia, and South Africa during 2005–2023.

high negative AI-driven productivity dynamics in BRICS economies: Evidence... Efficiency Change (EC) and Technological Change (TC) components of the Malmquist...

Despite rapid progress, a key problem remains: none of these systems can build complex 3D assemblies with moving parts. For example, no existing system can build a piston, a pendulum, or even a pair of scissors.

Negative capability claim based on the authors' survey of prior work (asserted limitation); no systematic benchmark or exhaustive evaluation numbers provided in the excerpt.

high negative Agent-Aided Design for Dynamic CAD Models capability to generate complex 3D assemblies with moving parts

Effective clarification remains challenging in software engineering tasks as not all missing information is equally valuable, and questions must target information users can realistically provide.

Analytic claim supported by the paper's empirical study of clarification in real software engineering tasks (methods mentioned: quantifying types of information affecting task success and simulated-user question-answering; no sample size given in the abstract).

high negative Asking What Matters: Reward-Driven Clarification for Softwar... impact of missing information and answerability on task success

Large language models remain confined to linguistic simulation rather than grounded understanding.

Conceptual assertion in the paper arguing limits of current models; no empirical tests or measurements reported.

high negative Governing Reflective Human-AI Collaboration: A Framework for... grounded_understanding (absence thereof)

Fluency is not reliability: without structures that stabilise both human and model reasoning, AI cannot be trusted or governed where it matters most.

Central thesis/claim of the paper; normative argument synthesising the paper's observations and proposals rather than an empirically tested finding provided here.

high negative The Missing Knowledge Layer in AI: A Framework for Stable Hu... trustworthiness/governability of AI in high-stakes contexts

Humans often mistake fluency for reliability: when a model responds smoothly, users tend to trust it, even when both model and user are drifting together.

Behavioral/psychological assertion in the paper referencing human interaction patterns with fluent outputs; no experimental data or sample size reported in this paper excerpt.

high negative The Missing Knowledge Layer in AI: A Framework for Stable Hu... user trust in model outputs

LLMs produce fluent outputs even when their internal reasoning has drifted; a confident answer can conceal uncertainty, speculation, or inconsistency, and small changes in phrasing can lead to different conclusions.

Conceptual/observational claim presented in the paper; no original empirical test or sample size reported here.

high negative The Missing Knowledge Layer in AI: A Framework for Stable Hu... reliability/consistency of model outputs (decision quality)

The opacity, fluency, and low-friction interaction patterns of LLMs obscure the boundary between human and machine contribution, leading users to infer competence from outputs rather than from the processes that generate them.

Theoretical argument grounded in prior literature on automation bias and cognitive offloading; presented as explanatory mechanism in the paper rather than an empirically tested causal estimate.

high negative The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... user inference of competence (output-based vs process-based attribution)

The paper introduces the 'LLM fallacy,' a cognitive attribution error in which individuals misinterpret LLM-assisted outputs as evidence of their own independent competence, producing a systematic divergence between perceived and actual capability.

Conceptual/theoretical claim and formal definition offered in the paper; no empirical validation reported in the abstract.

high negative The LLM Fallacy: Misattribution in AI-Assisted Cognitive Wor... divergence between perceived competence and actual competence when using LLM out...

Most Sub-Saharan African states still lack the institutional frameworks needed to turn these innovations into sustainable development.

Comparative policy analysis stated in the paper; no quantitative sample size or formal survey data reported in the excerpt.

high negative A Framework for Sovereign AI Governance and Economic Growth ... presence/absence of institutional frameworks enabling AI-driven sustainable deve...

Efficiency (e.g., minimizing time and cost with AI-only planning) does not equal effectiveness: optimizing for efficiency can erode team cognition and reduce decision quality.

Synthesis of experimental quantitative results (time/cost vs. risk capture and rework) and qualitative assessment indicating that AI-driven efficiency can come at the expense of risk awareness and planning robustness.

high negative Cognitive Offloading in Agile Teams: How Artificial Intellig... trade-off between efficiency and decision quality / team cognition

Human-only planning incurs substantial overhead.

Same controlled experiment reporting that human-only planning produced higher time and cost overheads relative to AI-assisted approaches.

high negative Cognitive Offloading in Agile Teams: How Artificial Intellig... planning overhead (time/cost)

AI-only planning increases rework due to unstated assumptions.

Experiment measured rework rates and accompanying qualitative analysis attributing increased rework in the AI-only condition to unstated assumptions made by algorithmic planning.

high negative Cognitive Offloading in Agile Teams: How Artificial Intellig... rework rates

AI-only planning significantly degrades risk capture rates.

Same controlled three-condition experiment on a live client deliverable; paper reports measures/qualitative indicators of risk capture rates and states degradation for AI-only condition.

high negative Cognitive Offloading in Agile Teams: How Artificial Intellig... risk capture rate

Two wrong-entity mutations escaped all consumer-contributed layers; only disambiguation and confirmation mechanisms intercept this class.

Empirical observation during the 25 scenario trials spanning seven failure families in the deployed multi-tenant evaluation; the paper reports two instances of wrong-entity mutations that were not blocked by consumer-contributed protections.

high negative Bounded Autonomy for Enterprise AI: Typed Action Contracts a... wrong-entity mutation errors (escaped protections)

The unconstrained AI configuration completed only 17 of 25 tasks.

Same evaluation described above: deployed multi-tenant enterprise application, 25 scenario trials comparing unconstrained AI (safety layers disabled) against bounded autonomy and manual operation.

high negative Bounded Autonomy for Enterprise AI: Typed Action Contracts a... tasks completed

Infrastructure constraints, particularly in developing countries, limit AI adoption in auditing.

Thematic analysis of reviewed articles noting infrastructure limitations (e.g., ICT infrastructure) in developing-country contexts.

high negative Implementing Artificial Intelligence in Auditing: A Systemat... infrastructure constraints affecting AI adoption

Limitations in auditor competencies (skills and training) hinder effective AI adoption in auditing.

Thematic findings across the sample of articles report auditor competency gaps as a challenge to AI implementation.

high negative Implementing Artificial Intelligence in Auditing: A Systemat... auditor competencies / skill gaps

Ethical and data privacy concerns are persistent challenges to AI implementation in auditing.

Recurring theme in the reviewed literature identified via thematic analysis; papers cite ethics and privacy as obstacles.

high negative Implementing Artificial Intelligence in Auditing: A Systemat... ethical and data privacy concerns as barriers

Several challenges persist for AI adoption in auditing, including high technology investment costs.

Thematic analysis of barriers reported across the 15 articles highlighting cost as a recurrent challenge.

high negative Implementing Artificial Intelligence in Auditing: A Systemat... barrier: technology investment costs to AI adoption

Conventional methods that use AI predictions as direct proxies for true labels can be inefficient or unreliable when the relationship between AI outputs and human labels is weak or misspecified.

The paper's motivation and critique of standard proxy-using approaches; asserted in the abstract as background rationale for the proposed method.

high negative Generative Augmented Inference efficiency/reliability of estimators using AI outputs as direct proxies

Human review remains necessary for maintainability and correct domain interpretation of generated scripts.

Qualitative finding from the mixed-method case study indicating limitations and the need for human oversight.

high negative Human-AI Collaboration for Scaling Agile Regression Testing:... maintainability and domain-correctness of test scripts

Validated test specifications accumulate faster than they are automated in many teams, limiting regression coverage and increasing manual work.

Observational claim stated in the paper as a motivating problem; likely based on industry experience and the Hacon case study context.

high negative Human-AI Collaboration for Scaling Agile Regression Testing:... regression coverage and manual testing workload

Existing AutoML methods partially alleviate this burden but remain limited to narrow aspects such as hyperparameter optimization and model selection within predefined search spaces, leaving the full development lifecycle largely dependent on human expertise.

Statement summarizing limitations of prior work (literature review/background in the paper).

high negative AIBuildAI: An AI Agent for Automatically Building AI Models scope and limitations of existing AutoML approaches

Developing high-performing AI models remains a labor-intensive process that requires expert practitioners to iteratively design architectures, engineer representations, implement training pipelines and refine approaches through empirical evaluation.

Background statement in paper's introduction; general literature context rather than a specific empirical test within this paper.

high negative AIBuildAI: An AI Agent for Automatically Building AI Models human labor intensity / need for expert practitioners in AI model development

Most AI tooling targets that fraction [the ~10% of the workday spent writing code].

Assertion made in the paper (abstract) as an observed mismatch between where AI tooling focuses and overall developer work activities.

high negative To Copilot and Beyond: 22 AI Systems Developers Want Built focus of AI tooling relative to developer time allocation

Failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.

Qualitative/quantitative failure analysis reported in abstract identifying obstacle categories (example given: cross-artifact consistency breakdowns).

high negative BankerToolBench: Evaluating AI Agents in End-to-End Investme... types of failure modes encountered (e.g., cross-artifact consistency issues)

Bankers rate 0% of GPT-5.4's outputs as client-ready.

Human ratings by bankers reported in abstract indicating none of the evaluated outputs from GPT-5.4 were judged client-ready.

high negative BankerToolBench: Evaluating AI Agents in End-to-End Investme... proportion of model outputs rated as client-ready by bankers

Even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria.

Evaluation results reported in abstract: model-level rubric pass/fail aggregated to show best model failure rate approaching ~50% of criteria.

high negative BankerToolBench: Evaluating AI Agents in End-to-End Investme... rubric criteria pass/fail rate for GPT-5.4

Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows.

Author assertion in paper abstract arguing current benchmarks are insufficient; presented as motivation for developing BTB rather than empirically tested within the abstract.

high negative BankerToolBench: Evaluating AI Agents in End-to-End Investme... fidelity of AI benchmarks to professional workflows

Models fail to distinguish reliable predictions from unreliable ones, achieving only ≈20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation.

Analysis in the paper comparing model self-reported confidence / predictability judgments to actual accuracy across the 405 tasks; reports ≈20% accuracy irrespective of confidence/predictability judgments.

high negative SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... calibration_of_confidence_vs_accuracy

Human expert performance on the benchmark is approximately 20%.

Reported comparison between human experts and models on SciPredict tasks; the paper states human performance is ≈20% (evaluated on the benchmark tasks).

high negative SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... prediction_accuracy

Model accuracies on SciPredict are 14-26%.

Empirical evaluation of multiple LLMs on the SciPredict benchmark (405 tasks); the paper reports aggregate model accuracy range 14–26%.

high negative SciPredict: Can LLMs Predict the Outcomes of Scientific Expe... prediction_accuracy

Regulatory and labor friction is scored per sector using actual compliance frameworks (Basel III, FDA AI guidance, HIPAA) and BLS union density data, and is applied as a haircut to base adoption rates via an S-curve ramp.

Paper description of friction scoring method referencing specific regulatory frameworks and BLS union density; applied in the model as a haircut and S-curve adoption ramp.

high negative AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... adjustment (haircut) to sectoral adoption rates due to regulatory and labor fric...

Restricting AI productivity gains to the labor-generated portion of each sector's gross value added reduces the naive addressable base by approximately 72 percent.

Bottom-up sectoral model described in the paper that applies labor share to gross value added across 21 NAICS industries; the paper explicitly states the labor-generated restriction reduces the naive addressable base by ~72%.

high negative AI Capex Is Justified: A Bottom-Up Sectoral Estimate of Arti... reduction in naive AI-addressable economic base when restricting gains to labor-...

Environmental demands place an upper bound on the degree of heterogeneity required in a distributed production system.

Theoretical claim derived from the Distributed Production System framework and discussed in the paper; supported by conceptual argument and model constraints rather than empirical data; no sample size reported.

high negative The Principle of Maximum Heterogeneity Optimises Productivit... required degree of heterogeneity (upper bound) given environmental demands

« Prev 1 2 3 … 10 11 12 … 130 131 Next »