Evidence (14055 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Existing approaches to AI explainability, grounding and hallucination detection do not address input fidelity because they focus on output quality rather than input fidelity.

Argument in the paper contrasting prior work on explainability and hallucination detection with the problem of input fidelity; based on literature review and conceptual analysis.

medium negative Participatory provenance as representational auditing for AI... scope of existing explainability/grounding/hallucination detection methods with ...

Human advisors suppressed warnings under pressure at two to four times the AI rate.

Comparison between human benchmark (1,201 participants) and LLM outputs (3,360 conversations) in the preregistered experiment; reported suppression rates for humans were 2–4x those for AIs.

medium negative Large Language Models Outperform Humans in Fraud Detection a... suppression rate of fraud warnings under pressure

Because experienced workers are aging out of the workforce, simultaneous curtailment of formative occupational layers by platforms may create a shortage of workers able to manage complex systems.

Argument combining demographic observation (aging workforce) with the paper's theoretical claim about erosion of entry-level apprenticeship layers; no empirical test or quantified projection provided.

medium negative When Platforms Replace the Pipeline: AI, Labor Erosion, and ... availability of skilled workers for supervisory/complex management roles

Microsoft's realized routing bias has been voluntarily constrained by a March 2026 multi-model pivot.

Paper's descriptive assessment based on observable product/strategy events (March 2026 pivot) and how that affects routing bias in the comparative mapping.

medium negative The Inference Bottleneck: A Formal Model of Vertical Foreclo... routing bias (degree realized/constrained)

Other models fail more severely (i.e., worse than the frontier models mentioned).

Comparative results across the 19 evaluated LLMs reported in the experiment indicate worse corruption rates for models not classified as 'frontier'.

medium negative LLMs Corrupt Your Documents When You Delegate document corruption / output quality

Because aggressive compression shifts interpretive burden to the model's reasoning phase, aggressive token compression can paradoxically increase overall cost.

Interpretation/explanation of the experimental result (causal mechanism proposed by authors) linking compression to increased reasoning burden; supported by the reported experiment but mechanism is inferential rather than directly measured in abstract.

medium negative Beyond Human-Readable: Rethinking Software Engineering Conve... distribution of computational/interpretive workload between input processing and...

There are universal bottlenecks requiring architectural innovations beyond parameter scaling.

Paper interpretation of results and analysis arguing that the observed limitations and asymmetries point to architectural bottlenecks that cannot be resolved solely by increasing model parameters.

medium negative ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... model capability limitations / architectural requirements

Model performance on ImplicitMemBench is far below human baselines.

Paper asserts model scores are 'far below human baselines' after reporting model percentages; the excerpt does not provide the numeric human baseline value.

medium negative ImplicitMemBench: Measuring Unconscious Behavioral Adaptatio... benchmark accuracy compared to human performance

Models are beginning to be deployed to generate revenue for the companies that created them through advertisements, creating potential conflicts of interest between company incentives and users' best interests.

Conceptual/observational claim advanced in the paper motivated by industry deployment trends and the authors' framework; not a quantified experimental result in the abstract.

medium negative Ads in AI Chatbots? An Analysis of How Large Language Models... corporate monetization of LLMs via advertisements and resulting incentive confli...

Scaling intelligence alone will not solve coordination problems in multi-agent systems and will require deliberate cooperative design, even when helping others costs nothing.

Conclusion drawn from the paper's experimental findings (comparative performance across models and responses to targeted interventions); presented as a general implication in the abstract.

medium negative More Capable, Less Cooperative? When LLMs Fail At Zero-Cost ... ability of scaling model capability alone to resolve coordination failures

Existing energy-focused guidelines and metrics have seen limited adoption among practitioners, leaving a gap between research and everyday coding practice.

Claim made in paper's background/motivation; no adoption-rate data included in the excerpt.

medium negative EcoAssist: Embedding Sustainability into AI-Assisted Fronten... adoption of energy-focused guidelines and metrics by practitioners

Unstructured physical trades and high-stakes caretaking roles exhibit absolute resilience to LLM-driven automation (i.e., very low OAI), quantifying a 'Cognitive Risk Asymmetry.'

Empirical classification from computed OAIs showing low exposure for unstructured physical trades and high-stakes caretaking roles; the excerpt does not provide specific OAI values or counts.

medium negative Bounded by Risk, Not Capability: Quantifying AI Occupational... Relative Occupational Automation Index (OAI) for unstructured physical trades an...

Variance-based Human-in-the-Loop (HITL) validation with an expert panel demonstrates a profound cognitive gap: isolated algorithmic probabilities fail to encapsulate the "institutional premium" imposed by experts bounded by professional liability.

Empirical validation procedure reported: variance-based HITL validation involving an expert panel that compared algorithmic scores and expert adjustments, concluding a systematic difference attributed to institutional liability considerations. The excerpt does not give panel size or quantitative variance statistics.

medium negative Bounded by Risk, Not Capability: Quantifying AI Occupational... difference between algorithmic probabilities and expert-assessed risk (instituti...

Industry self-regulation has demonstrably failed, motivating the need for IASCA.

Proposal asserts a 'demonstrated failure of industry self-regulation' as rationale for IASCA; no specific empirical studies, incidents, or metrics are cited in the provided text.

medium negative IASCA: The International AI Safety Certification Authority —... effectiveness of industry self-regulation

Roughly half of the projected LFPR decline to 55% by 2050 is attributable to AI—equivalent to around 10 million lost jobs.

Authors' decomposition/interpretation of conditional forecast results under the rapid scenario reported in the abstract (ties LFPR decline to job-count equivalents).

medium negative Forecasting the Economic Effects of AI job losses attributable to AI (by 2050, rapid scenario)

Our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation.

Comparative claim referencing prior observations in text-to-SQL literature and the authors' audit results on ELT-Bench; no new cross-benchmark quantitative analysis reported in the excerpt.

medium negative ELT-Bench-Verified: Benchmark Quality Issues Underestimate A... presence of systemic annotation/benchmark quality issues across data engineering...

That measured machine-equivalent work appeared on no financial statement, workforce report, or government statistical return.

Claim about absence of reporting for the deployment's measured work (asserted in the paper for the deployment case).

medium negative HEWU: A Standardized Framework for Measuring Machine-Generat... reporting/disclosure of machine labor in formal records

The AI-as-advisor approach has limitations: people frequently ignore accurate advice, rely too much on inaccurate advice, and their decision-making skills may deteriorate over time.

Paper asserts these limitations in motivation/background and/or derives them from observed behavior in experiments (stated in abstract as known problems with AI-as-advisor).

medium negative Beyond AI advice -- independent aggregation boosts human-AI ... skill deterioration / susceptibility to incorrect advice

When given a choice between which information source to give to an AI agent, a large portion of subjects fail to select the more informative one.

Experimental condition where subjects chose which source (prompt vs revealed-preference data) to provide to an AI agent; reported result that a large portion did not choose the more informative source.

medium negative Should I State or Should I Show? Aligning AI with Human Pref... choice by subjects of which information source to provide to the AI (rate of sel...

The gap in predictive accuracy is driven by subjects' difficulty in translating their own preferences into written instructions.

Further analysis reported in the experiment attributing the observed accuracy gap to subjects' difficulty converting their preferences into prompts (presumably via analysis comparing content of prompts to revealed choices).

medium negative Should I State or Should I Show? Aligning AI with Human Pref... degree to which prompt quality explains predictive accuracy gap (i.e., translati...

The emergence and diffusion of these technologies create an era of labor displacement.

Framed in the paper as a premise motivating policy proposals; presented as a conceptual claim rather than supported by original empirical estimates in the text provided.

medium negative IoT, artificial intelligence, cloud computing and robotics a... labor displacement (job loss/occupational displacement)

Many automotive firms, especially those developing new energy and intelligent vehicles, have suffered financial distress and even exited the market.

Descriptive statement in the paper's introduction/motivation citing observed industry outcomes (financial distress and market exit) among automotive firms focused on NEV and intelligent vehicles.

medium negative The 'Intelligent Trap' in Corporate Finance—A Study Based on... financial distress / market exit

The dominant mechanism behind the performance drop is a collapse of Type2_Contextual issue detection at config_B, consistent with attention dilution in long contexts.

Analysis of issue-type specific detection rates shows Type2_Contextual detection collapses at config_B; interpretation ties this to attention dilution in longer contexts.

medium negative SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... Type2_Contextual issue detection rate

The economic inevitability of technological transformation (in agentic finance) and the critical urgency of proactive intervention.

Author claim synthesizing the paper's argument and modeling results (normative conclusion based on earlier analysis and assertions, not a validated empirical finding).

medium negative STRENGTHENING FINANCIAL WORKFORCE COMPETITIVENESS: A CURRICU... likelihood of technology-driven structural change in the finance workforce

Surveillance intensity is associated with hyper-vigilance (reported effect = -4.213).

One of the six propositions from the paper's trilevel framework; the abstract reports an effect value of '-4.213' associated with surveillance intensity → hyper-vigilance.

medium negative Algorithmic Control and Psychological Risk in Digitally Mana... hyper-vigilance (psychological arousal/state)

Platform workers receive 36.3% more third-party ratings than traditional workers.

Quantitative synthesis/summary reported in the paper (no primary sample size in abstract); likely aggregated from included studies.

medium negative Algorithmic Control and Psychological Risk in Digitally Mana... number of third-party ratings received

Platform workers experience 59.6% higher digital speed determination than traditional workers.

Quantitative synthesis/summary reported in the paper (no primary sample size given in the abstract); presumably aggregated from included studies comparing platform and traditional workers.

medium negative Algorithmic Control and Psychological Risk in Digitally Mana... digital speed determination

Our findings surface practical limits on the complexity people can manage in human-AI negotiation.

Synthesis claim based on the empirical study varying number of issues and observed decline in performance beyond three issues; presented as a conceptual/practical implication of the results.

medium negative From Overload to Convergence: Supporting Multi-Issue Human-A... maximum manageable negotiation complexity (number of issues before performance d...

Multiple competing arbitrageurs drive down consumer prices, reducing the marginal revenue of model providers.

Analytic argument and empirical/simulation results reported in the paper showing that competition among arbitrageurs lowers prices faced by consumers and decreases marginal revenue for model providers.

medium negative Computational Arbitrage in AI Model Markets consumer prices and marginal revenue of model providers

Distillation further creates strong arbitrage opportunities, potentially at the expense of the teacher model's revenue.

Experiments or analyses involving model distillation reported in the paper showing that distilled/student models enable profitable arbitrage and may reduce revenue captured by the original teacher model.

medium negative Computational Arbitrage in AI Model Markets arbitrage profitability enabled by distilled models and impact on teacher model ...

The pre-existing AI community dissolved as the tools went mainstream, and the new vocabulary was absorbed into existing careers rather than binding a new occupation.

Interpretation of resume-data patterns: observed dispersion of previously coherent AI practitioners and spread of AI-related vocabulary into other occupational records rather than consolidation into a new occupational cluster.

medium negative NLP Occupational Emergence Analysis: How Occupations Form an... population cohesion / absorption into existing careers (dissolution of standalon...

Beyond an environment-specific optimum, scaling further degrades institutional fitness because trust erosion and cost penalties outweigh marginal capability gains.

Analytical argument from the Institutional Scaling Law together with illustrative examples and discussion of mechanisms (trust erosion, cost penalties) in the paper.

medium negative Punctuated Equilibria in Artificial Intelligence: The Instit... institutional fitness (net effect of capability, trust, cost, compliance)

Bias effects vary by vulnerability type, with injection flaws being more susceptible to framing bias than memory corruption bugs.

Subgroup analysis in Study 1 comparing framing sensitivity across vulnerability classes (injection vs memory corruption) within the experiment dataset.

medium negative Measuring and Exploiting Confirmation Bias in LLM-Assisted S... change in vulnerability detection rate by vulnerability type

Model convergence in DRL can lead to crowded trades, which has implications for market stability and motivates a robust regulatory framework balancing innovation with market stability.

Analytical argument in the paper linking convergence/crowding to systemic effects; the excerpt does not include empirical market-impact studies, simulations, or measured incidence rates of crowding.

medium negative Deep Reinforcement Learning for Dynamic Portfolio Optimizati... market stability / systemic risk (incidence or severity of crowded trades result...

Deploying DRL at scale requires socio-technical infrastructure considerations including algorithmic governance, systemic risk management, and accounting for the environmental cost of large-scale computational finance.

Conceptual and system-level analysis presented in the paper; no empirical auditing data, carbon-footprint measurements, or governance case studies are provided in the excerpt.

medium negative Deep Reinforcement Learning for Dynamic Portfolio Optimizati... governance readiness, systemic risk exposure, and environmental/resource cost me...

Two sources of spurious performance addressed are memorization bias from ticker-specific pre-training and survivorship bias from flawed backtesting.

Problem identification and methodological focus: the paper names memorization bias and survivorship bias as primary confounders it aims to mitigate. The excerpt does not detail experiments that quantify the magnitude of those biases or the degree to which they were reduced.

medium negative Can Blindfolded LLMs Still Trade? An Anonymization-First Fra... reduction/mitigation of spurious performance attributable to memorization and su...

Traditional ex ante regulatory approaches struggle to keep pace with AI development, exacerbating the 'pacing problem' and the Collingridge dilemma.

Theoretical/legal literature review and conceptual argument presented in the paper (no empirical sample or quantitative data reported in the abstract).

medium negative Experimentalism beyond ex ante regulation: A law and economi... regulatory responsiveness/effectiveness in relation to AI technological change

Low internal conflict or unanimity can be diagnostic of variance depletion (i.e., exclusion) rather than healthy integration, so governance systems should treat low conflict as a potential red flag until heterogeneity integration is verified.

Interpretive policy implication derived from the model's demonstration that exclusionary processes can produce deceptively low observed disagreement while increasing fragility; this recommendation is based on theoretical reasoning without empirical validation in the paper.

medium negative Cohesion as Concentration: Exclusion-Driven Fragility in Fin... internal conflict levels (observed dissent/unanimity) as indicator of variance d...

Most existing candidate matching systems act as keyword filters, failing to handle skill synonyms and nonlinear careers, resulting in missed candidates and opaque match scores.

Paper's introductory assertion about limitations of most current systems. The excerpt does not cite empirical studies, statistics, or systematic reviews to substantiate this claim.

medium negative JobMatchAI An Intelligent Job Matching Platform Using Knowle... limitations of extant systems: keyword-filter behavior, failure on skill synonym...

TDD (test-driven development) prompting alone increased regressions to 9.94%.

Empirical result reported in the paper comparing a TDD prompting intervention against other workflows on the benchmark (values given in the excerpt).

medium negative TDAD: Test-Driven Agentic Development - Reducing Code Regres... regression rate (percentage of tests that regressed) under TDD prompting

Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied.

Paper's critique of existing benchmark literature and practices (asserted by authors in background; no specific benchmark survey details in the excerpt).

medium negative TDAD: Test-Driven Agentic Development - Reducing Code Regres... coverage of regression measurement in existing benchmarks

The paper identifies five structural challenges arising from the memory governance gap: memory silos across agent workflows; governance fragmentation across teams and tools; unstructured memories unusable by downstream systems; redundant context delivery in autonomous multi-step executions; and silent quality degradation without feedback loops.

Qualitative analysis and problem framing presented in the paper (authors' identification of five specific challenges).

medium negative Governed Memory: A Production Architecture for Multi-Agent W... presence/identification of five structural governance challenges

AI raises managerial cognitive complexity and creates recurring tensions between algorithmic optimisation and systemic, ethical reasoning.

Theoretical synthesis highlighting emergent tensions from integrating computational optimisation with systems thinking and ethical considerations; conceptual, no empirical tests.

medium negative Comparative analysis of strategic vs. computational thinking... managerial cognitive complexity and frequency/severity of optimisation vs ethica...

Underprovision of verification is likely if left to market forces because information quality has positive externalities and misinformation imposes negative externalities, justifying public funding, subsidies, or regulation.

Economic reasoning and policy implications drawn from the study's findings and the literature on public goods/externalities.

medium negative Fact-Checking Platforms in the Middle East: A Comparative St... level of provision of verification services relative to social optimum

Censorship, restricted data flows, and government interference fragment markets, limit economies of scale, and favor well-resourced, internationally connected actors—widening capacity gaps.

Interpretive economic analysis grounded in observed access constraints and comparative case material across the three platforms.

medium negative Fact-Checking Platforms in the Middle East: A Comparative St... market fragmentation and distribution of capacity among actors

Limited data access and censorship reduce the efficacy of AI tools by creating training and validation gaps; legal risks complicate use of proprietary platforms and cloud services.

Interviews describing constraints on data availability and legal/operational barriers to using some platforms and cloud services; interpretive analysis of implications for AI training/validation.

medium negative Fact-Checking Platforms in the Middle East: A Comparative St... AI tool effectiveness (training/validation quality) and deployability

Generative AI increases the volume and sophistication of misinformation (deepfakes, fabricated documents), raises false-positive risks, and can be weaponized by state or nonstate actors.

Interview accounts and qualitative analysis noting observed or anticipated misuse of generative models and associated verification challenges.

medium negative Fact-Checking Platforms in the Middle East: A Comparative St... misinformation volume/sophistication and verification error risk

Resource constraints—limited staff time, funding, and technical capacity—are recurring operational challenges for these platforms.

Staff and stakeholder interviews plus analysis of organizational reports indicating staffing, funding, and technical limitations.

medium negative Fact-Checking Platforms in the Middle East: A Comparative St... staffing levels, funding availability, technical capacity

Platforms experience difficulty building and retaining audience trust and engagement, especially in contexts of high public skepticism or polarization.

Interview data from platform staff describing audience engagement challenges, supported by analysis of audience-focused platform formats and community-reporting strategies.

medium negative Fact-Checking Platforms in the Middle East: A Comparative St... audience trust and engagement levels

Platforms face limited or asymmetric access to primary data sources such as platform APIs, state data, and archives.

Interview accounts and document analysis noting restricted API access and barriers to state-held data and archives across the three cases.

medium negative Fact-Checking Platforms in the Middle East: A Comparative St... access to primary data sources

« Prev 1 2 3 … 201 202 203 … 281 282 Next »