Evidence (6507 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	609	159	77	736	1615
Governance & Regulation	664	329	160	99	1273
Organizational Efficiency	624	143	105	70	949
Technology Adoption Rate	502	176	98	78	861
Research Productivity	348	109	48	322	836
Output Quality	391	120	44	40	595
Firm Productivity	385	46	85	17	539
Decision Quality	275	143	62	34	521
AI Safety & Ethics	183	241	59	30	517
Market Structure	152	154	109	20	440
Task Allocation	158	50	56	26	295
Innovation Output	178	23	38	17	257
Skill Acquisition	137	52	50	13	252
Fiscal & Macroeconomic	120	64	38	23	252
Employment Level	93	46	96	12	249
Firm Revenue	130	43	26	3	202
Consumer Welfare	99	51	40	11	201
Inequality Measures	36	105	40	6	187
Task Completion Time	134	18	6	5	163
Worker Satisfaction	79	54	16	11	160
Error Rate	64	78	8	1	151
Regulatory Compliance	69	64	14	3	150
Training Effectiveness	81	15	13	18	129
Wages & Compensation	70	25	22	6	123
Team Performance	74	16	21	9	121
Automation Exposure	41	48	19	9	120
Job Displacement	11	71	16	1	99
Developer Productivity	71	14	9	3	98
Hiring & Recruitment	49	7	8	3	67
Social Protection	26	14	8	2	50
Creative Output	26	14	6	2	49
Skill Obsolescence	5	37	5	1	48
Labor Share of Income	12	13	12	—	37
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Productivity Remove filter

Retail supply chain operations in supermarket chains involve continuous, high-volume manual workflows spanning demand forecasting, procurement, supplier coordination, and inventory replenishment.

Descriptive claim stated in the paper's introduction/abstract; no empirical data, sample, or methods reported to substantiate this characterization within the text provided.

high negative Flowr -- Scaling Up Retail Supply Chain Operations Through A... degree of manual operations / automation exposure

The two margins interact through a self-undermining feedback that can generate low-archive traps (multiple equilibria with low accumulated public archive).

Dynamic equilibrium analysis in the theoretical model showing interacting feedbacks and possible trap equilibria (model-derived result).

high negative When AI Improves Answers but Slows Knowledge Creation: Match... accumulated archive size / equilibrium archive level

Resolution margin: the probability that posted queries are resolved declines because AI raises contributors' outside options, thinning the contributor pool and creating congestion on the platform.

Mechanism and comparative-static implication produced by the paper's theoretical model; no empirical sample provided in the excerpt.

high negative When AI Improves Answers but Slows Knowledge Creation: Match... probability that posted queries are resolved (conditional resolution rate)

Flow margin: the posted volume of knowledge-enhancing queries declines as AI resolves more problems privately before they reach the platform.

Mechanism derived in the theoretical model; stated as the flow-margin channel (no empirical quantification in the provided text).

high negative When AI Improves Answers but Slows Knowledge Creation: Match... posted volume of knowledge-enhancing queries

AI reduces archive creation through two distinct margins: a flow margin and a resolution margin.

Analytical decomposition derived within the paper's theoretical model (mechanism claimed by the model).

high negative When AI Improves Answers but Slows Knowledge Creation: Match... archive creation (rate and quality of accumulated solutions)

Generative AI resolves user problems without leaving a public trace, so fewer discussions and solutions reach public platforms.

Stated as an empirical motivation in the paper; no empirical sample or quantified measurement reported in the provided text.

high negative When AI Improves Answers but Slows Knowledge Creation: Match... volume of public posts / archival content

Green AI research has largely measured the footprint of models rather than the downstream workflows in which GenAI is a tool.

Literature review / mapping of recent Green AI literature reported in the paper; descriptive claim about the focus of the field (no sample size or numerical counts reported in the abstract).

high negative On the Carbon Footprint of Economic Research in the Age of G... scope/emphasis of Green AI research (model-level vs. workflow-level measurement)

Existing benchmarks differ from real usage in programming language distribution, prompt style and codebase structure.

Paper asserts mismatch between existing benchmarks and production usage as motivation for producing a production-derived benchmark (stated differences: language distribution, prompt style, codebase structure).

high negative ProdCodeBench: A Production-Derived Benchmark for Evaluating... representativeness of benchmarks relative to real usage

Replacing deterministic components with probabilistic workflows changes the failure mode: LLM pipelines may generate plausible but incorrect outputs that pass superficial checks and propagate into irreversible actions such as DOI minting and public release.

Conceptual argument supported by the paper's incident descriptions (e.g., a detected coordinate transformation error); the statement is presented as a general risk rationale.

high negative Exploring Robust Multi-Agent Workflows for Environmental Dat... propensity for plausible-but-incorrect outputs to bypass checks and propagate to...

Occupations whose AI-exposed steps are more dispersed across the production workflow (higher fragmentation) exhibit a substantially lower share of their steps actually executed by AI, conditional on AI exposure share.

Empirical regression analysis controlling for share of AI-exposed steps; uses dataset linking O*NET tasks, human AI exposure assessments, Anthropic Economic Index execution outcomes, and GPT-generated workflow orderings (details in Sections 5.1 and 7).

high negative Chaining Tasks, Redefining Work: A Theory of AI Automation share (fraction) of steps executed by AI at the occupation/job level

Treated firms' demand for external capital investment falls by just over $220,000 relative to the control group.

RCT with 515 firms; reported dollar-change in external investment demand between treated and control firms.

high negative Mapping AI into Production: A Field Experiment on Firm Perfo... change in external capital investment demand (USD)

Despite faster growth, treated firms do not scale inputs proportionally: their demand for external capital investment falls by 39.5% relative to the control group.

RCT with 515 firms; firms reported external capital demand/investment requests; comparison of investment demand between treatment and control groups.

high negative Mapping AI into Production: A Field Experiment on Firm Perfo... demand for external capital investment

For the private business sector, if the set of automated tasks were frozen in 1950, 87% of TFP growth between 1950 and 2023 would have been eliminated.

Counterfactual growth-accounting exercise that freezes the set of automated tasks at 1950 while allowing capital, labor, and other productivity growth to follow historical rates (simulation based on calibrated accounting).

high negative Past Automation and Future A.I.: How Weak Links Tame the Gro... fraction of historical TFP growth eliminated by freezing automation

The sum of "other" TFP growth and average labor productivity growth (ˆZt + ˆψℓt) is small — for example equal to -0.1% per year for the private business sector since 1950.

Growth-accounting decomposition for the private business sector since 1950 using BEA/BLS data in the task-based framework.

high negative Past Automation and Future A.I.: How Weak Links Tame the Gro... combined growth rate of other TFP and average labor productivity (ˆZt + ˆψℓt)

Under the rapid scenario, economists forecast the share of wealth held by the wealthiest 10% of households rising to 80.0% by 2050.

Conditional forecasts in Key Findings for the economist respondent group under the rapid AI scenario (2050 horizon).

high negative Forecasting the Economic Effects of AI fraction of wealth held by top 10% of households by 2050 (rapid scenario)

Conditional on the rapid scenario, economists forecast the labor force participation rate falling from its current level of 62% to 55% by 2050.

Conditional forecasts in Key Findings for the economist respondent group under the rapid AI scenario (2050 horizon).

high negative Forecasting the Economic Effects of AI labor force participation rate (LFPR) by 2050 under rapid scenario

There are macroeconomic risks associated with AI-led unemployment.

Paper's macroeconomic analysis drawing on labor economics and technology adoption research; no quantitative estimates or sample sizes provided in the summary.

high negative A Shorter Workweek as Economic Infrastructure: Managing AI-D... macroeconomic risk indicators (e.g., unemployment, aggregate demand shortfalls)

Managerial incentives drive premature workforce contraction during AI adoption.

Analytical claim grounded in labor economics and organizational behavior review; the summary indicates examination of managerial incentives but does not report primary empirical tests or sample sizes.

high negative A Shorter Workweek as Economic Infrastructure: Managing AI-D... timing and extent of workforce contraction

Premature workforce contraction in response to AI adoption foreshadows deeper structural challenges as AI systems mature.

Forward-looking claim based on synthesis of literature and theoretical projection; no empirical quantification or sample provided in the summary.

high negative A Shorter Workweek as Economic Infrastructure: Managing AI-D... long-run structural economic challenges (e.g., systemic instability, labor marke...

This pattern of premature workforce reductions reflects longstanding corporate short-termism rather than genuine technological displacement.

The paper's interpretation drawing on labor economics and organizational behavior literature; no empirical study or sample size reported in the summary.

high negative A Shorter Workweek as Economic Infrastructure: Managing AI-D... drivers of workforce reduction (managerial incentives vs. actual automation capa...

Organizations face mounting pressure to demonstrate immediate returns on AI investments, often through workforce reductions that outpace actual automation capabilities.

Argument in paper citing accelerating AI adoption across sectors and observed managerial responses; no primary dataset or sample size reported in the text.

high negative A Shorter Workweek as Economic Infrastructure: Managing AI-D... workforce reductions / layoffs

Applying the Auditor-Corrector methodology to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors — including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth — that penalize correct agent outputs.

Audit results on ELT-Bench identifying categories of benchmark errors (rigid scripts, ambiguous specs, incorrect ground truth) and attributing many failed transformation tasks to these errors; no numeric breakdown or sample count given in the excerpt.

high negative ELT-Bench-Verified: Benchmark Quality Issues Underestimate A... proportion of failed transformation tasks attributable to benchmark errors (qual...

On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility.

Reference to initial evaluation results on ELT-Bench showing low success rates for AI agents; the provided excerpt does not give numerical success rates or sample size.

high negative ELT-Bench-Verified: Benchmark Quality Issues Underestimate A... agent success rate on ELT-Bench (agent capability / practical utility)

The way we're thinking about generative AI right now is fundamentally individual (this appears in how users interact with models, how models are built, how they're benchmarked, and how commercial and research strategies using AI are defined).

Author's observational/descriptive claim supported by argumentative examples (mentions user interaction patterns, model design and benchmarking practices, and commercial/research strategies); no empirical sample or quantitative analysis reported in the excerpt.

high negative The Future of AI is Many, Not One conceptual framing and practices around generative AI (individual-focused design...

Traditional questionnaires yielded slightly higher accuracy in risk assessment.

Result reported from the two experiments comparing traditional questionnaires to adaptive ARQuest versions; no numeric accuracy or sample size provided in the excerpt.

high negative AI in Insurance: Adaptive Questionnaires for Improved Risk P... risk assessment accuracy

Insurers must blindly trust users' responses, increasing the chances of fraud.

Stated as a motivating problem in the paper; presented as logical/empirical concern rather than supported by a reported study within the paper.

high negative AI in Insurance: Adaptive Questionnaires for Improved Risk P... fraud risk from self-reported responses

Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences.

Descriptive claim in paper introduction arguing limitations of standard questionnaires; no experiment or sample size reported for this assertion.

high negative AI in Insurance: Adaptive Questionnaires for Improved Risk P... ability of standardized questionnaires to capture individual differences

Using a stylised inpatient capacity signalling example and minimal game-theoretic reasoning, task optimisation alone is unlikely to change system outcomes when incentives are unchanged.

Theoretical analysis using a stylised inpatient capacity signalling example and game-theoretic reasoning presented in the paper (no empirical data/sample reported in the abstract).

high negative Incentives, Equilibria, and the Limits of Healthcare AI: A G... system-level outcomes in healthcare (response to task optimisation interventions...

Deployment of AI systems carries significant costs including ongoing costs of monitoring and it is unclear whether optimism of a deus ex machina solution is well-placed.

Conceptual/argumentative claim made by the authors in the paper (no empirical study or sample size reported in the abstract).

high negative Incentives, Equilibria, and the Limits of Healthcare AI: A G... costs and uncertainty associated with AI deployment (including monitoring costs)

Improvements in operational resilience (OR) effectively reduce corporate operational risk.

Further analysis reported in the paper linking higher OR to lower operational risk measures for firms in the sample.

high negative Does Artificial Intelligence Improve the Operational Resilie... corporate operational risk (reduction)

AI promotes operational resilience by reducing management agency conflicts.

Mechanism (mediation) tests reported in the paper showing AI associated with reductions in measures of agency/management conflict, which in turn relate to OR improvements.

high negative Does Artificial Intelligence Improve the Operational Resilie... management agency conflicts (reduction)

No regulatory framework requires disclosure of machine/AI labor output.

Author's assertion in the paper (policy claim; no legislative survey or quantification reported).

high negative HEWU: A Standardized Framework for Measuring Machine-Generat... presence of regulatory disclosure requirements for machine labor

No index tracks machine labor output over time.

Author's assertion in the paper (stated lack of existing indices; no systematic review/sample reported).

high negative HEWU: A Standardized Framework for Measuring Machine-Generat... existence of time-series index for machine labor output

This labor force is entirely invisible to the economic infrastructure humanity has built to measure work: no standardized unit of measurement exists.

Author's assertion/diagnosis in the paper (argumentative/observational, no empirical survey or sample reported).

high negative HEWU: A Standardized Framework for Measuring Machine-Generat... existence of standardized unit for machine labor

Agent contributions are associated with more churn over time compared to human-authored code.

Longitudinal comparison between agent-generated and human-authored contributions reported in the paper (churn/survival estimates described; association between agent contributions and higher churn asserted).

high negative Investigating Autonomous Agent Contributions in the Wild: Ac... code churn rate over time (agent-generated vs human-authored)

Unbalanced or poorly governed adoption of Big Data and AI contributes to increased systemic risk, cybersecurity vulnerability, regulatory fragmentation and third-party dependence on BigTech platforms.

Argument based on qualitative literature review and synthesis of international empirical studies and comparative sector analysis; no single-sample empirical study in this paper.

high negative Implications of Big Data Technologies for the Resilience of ... systemic risk; cybersecurity vulnerability; regulatory fragmentation; third-part...

Extreme automation (high AI intensity) causes employment decline.

Part of the U-shaped relationship reported by the paper's empirical results; described qualitatively in the abstract/summary.

high negative Impact Of Artificial Intelligence (AI) On Employment employment decline

Task orchestration is the most under-researched dimension among the five workplace-design components.

Finding from the PRISMA-guided systematic review of 120 papers, which mapped coverage across the five dimensions and identified task orchestration as having the least research attention.

high negative From Automation to Augmentation: A Framework for Designing H... volume/coverage of research on task orchestration

Decision authority allocation emerges as the binding constraint for Society 5.0 transitions.

Result synthesized from the systematic review and theoretical analysis mapping the five workplace-design dimensions; stated as the binding constraint in the paper's findings.

high negative From Automation to Augmentation: A Framework for Designing H... constraint on transitions to human-centric (Society 5.0) technology integration

The literature shows persistent gaps in empirical validation, standardized evaluation methods, and sector-specific comparative analyses of agentic AI in financial services.

Review-level assessment noting limited empirical studies, heterogeneous evaluation metrics, and few direct cross-sector comparisons up to mid-2024.

high negative A Comparative & Systematic Review of Literature on the I... availability/quality of empirical validation and evaluation standards

Significant implementation barriers persist, notably workforce transformation challenges, legacy system integration difficulties, and trust deficits.

Thematic synthesis across empirical and conceptual papers in the review reporting implementation barriers and change management issues.

high negative A Comparative & Systematic Review of Literature on the I... implementation barriers (workforce, legacy systems, trust)

Ethical concerns—including bias, lack of transparency, and regulatory compliance risks—remain critical for agentic AI in financial services and necessitate layered governance and human-AI collaboration.

Collation of ethical, legal, and governance issues reported across the reviewed multidisciplinary studies and normative discussions.

high negative A Comparative & Systematic Review of Literature on the I... prevalence/severity of ethical and regulatory risks and governance needs

Insurance is comparatively underrepresented in the literature and in reported agentic AI deployments compared with banking and investment.

Review finding (counts/themes across included studies indicating fewer studies/applications in insurance relative to banking and investment).

high negative A Comparative & Systematic Review of Literature on the I... relative representation/adoption across financial subsectors

A weak manager directing a weak worker achieves a 42% success rate, performing worse than the weak agent alone which achieves 44%.

Empirical comparison across the same 200 SWE-bench Lite instances and pipeline configurations, comparing weak-manager+weak-worker pipeline to weak single-agent baseline.

high negative Can AI Models Direct Each Other? Organizational Structure as... task success rate (percentage of tasks solved)

Task complexity shapes substitution: low-complexity tasks see high substitution, while high-complexity tasks favor limited partial automation.

Calibration of the model to O*NET tasks + expert survey + GPT-4o decompositions; implementation results reported for computer vision showing substitution varies with task complexity.

high negative Economics of Human and AI Collaboration: When is Partial Aut... degree of labor substitution as a function of task complexity

AI systems exhibit predictable but diminishing returns to data, compute, and model size (scaling-law experiments), implying the cost of higher accuracy is convex: good performance may be inexpensive, but near-perfect accuracy is disproportionately costly.

Scaling-law experiments estimating performance as a function of data, compute, and model size; described experimental estimation of production function.

high negative Economics of Human and AI Collaboration: When is Partial Aut... marginal returns to inputs (data, compute, model size) and marginal cost of accu...

The common claim that generative AI simply amplifies the Dunning–Kruger effect is too coarse to capture the available evidence.

Paper's synthesis of heterogenous empirical findings from human–AI interaction, learning research, and model evaluation used to critique the uniform-amplification interpretation; no single empirical countertest reported.

high negative Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupli... validity of the 'amplified Dunning–Kruger' interpretation

LLM use degrades metacognitive accuracy and flattens the classic competence–confidence gradient across skill groups (i.e., reduces calibration and narrows differences in self-assessed confidence by skill level).

Synthesis of studies from human–AI interaction and learning research reported in the paper that document worsened calibration and a reduction in the competence–confidence gradient when users rely on LLM outputs; the paper does not report a single combined sample size.

high negative Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupli... metacognitive accuracy / calibration and competence–confidence gradient

The agent team topology exhibits higher operational fragility due to multi-author code generation.

Reported empirical observation from experiments comparing architectures, attributing increased fragility/errors to multi-author code generation in the agent team setup (stated qualitatively; no quantitative failure rates provided in the abstract).

high negative An Empirical Study of Multi-Agent Collaboration for Automate... operational fragility / error-proneness associated with multi-author code genera...

Azar et al. (2023) show that monopsonistic employers have stronger incentives to automate and document that US commuting zones with higher labor market concentration experienced more robot adoption.

Citation reported in the paper summarizing Azar et al. (2023); empirical analysis across US commuting zones (no sample size provided here).

high negative NBER WORKING PAPER SERIES robot adoption correlated with labor market concentration; incentives to automat...

« Prev 1 2 3 … 12 13 14 … 130 131 Next »