Evidence (8570 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Adoption Remove filter

Using a frontier model's system prompt to supply the procedure exposes proprietary procedures to third-party providers.

Author statement describing privacy/proprietary risk as a cost of the system-prompt approach (qualitative claim).

high negative Compiling Agentic Workflows into LLM Weights: Near-Frontier ... exposure of proprietary procedures to third-party providers (privacy/intellectua...

Using a frontier model's system prompt to supply the procedure requires a frontier model for every conversation.

Author statement describing operational/cost trade-offs associated with the system-prompt approach (qualitative claim).

high negative Compiling Agentic Workflows into LLM Weights: Near-Frontier ... requirement to use frontier model per conversation (operational/deployment cost)

Using a frontier model's system prompt to supply the procedure has costs: it consumes the context window.

Author statement referencing trade-offs identified alongside the Dennis et al. result; cost described qualitatively (context window consumption).

high negative Compiling Agentic Workflows into LLM Weights: Near-Frontier ... context-window usage

Emerging evidence indicates that algorithms often inherit and amplify the historical biases present in training data.

Literature claim in paper referencing 'emerging evidence' and empirical studies (2024–2026) — specific studies, methods, and sample sizes not included in excerpt.

high negative The Algorithmic Mirror: Can Artificial Intelligence Truly Mi... presence and amplification of historical bias in algorithmic outputs

Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs.

Empirical comparison in the paper between single-threshold scoring and tail-inclusive (continuous/unbounded) scoring on identical forecast outputs, showing sign reversal of the capability–accuracy relationship (numerical details not provided in excerpt).

high negative Is Capability a Liability? More Capable Language Models Make... capability–accuracy relationship under tail-inclusive scoring (impact of model c...

A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect.

Within-family empirical comparisons using Llama-3.1 variants examining effects of model scale and post-training (fine-tuning) on forecasting calibration (details and sample sizes not provided in excerpt).

high negative Is Capability a Liability? More Capable Language Models Make... relationship between model scale / post-training and forecasting calibration (di...

A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put.

Per-quantile decomposition analyses of model predictive distributions reported in the paper, showing quantile-specific changes (specific quantitative results not given in excerpt).

high negative Is Capability a Liability? More Capable Language Models Make... upper-tail forecast calibration / shift in predictive quantiles

The pattern replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation.

Empirical replication reported on multiple real-world datasets (COVID-19, measles, housing markets, hyperinflation) presented in the paper (dataset sizes not provided in excerpt).

high negative Is Capability a Liability? More Capable Language Models Make... forecast performance on real-world time series (distributional forecasts / calib...

The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control.

Results on the authors' released simulated benchmark (ForecastBench-Sim) using synthetic SIR epidemic simulations and a matched linear-control experiment reported in the paper (specific number of simulations or runs not stated in excerpt).

high negative Is Capability a Liability? More Capable Language Models Make... forecast performance on simulated SIR epidemics (distributional forecasts)

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change ... more capable models produce worse distributional forecasts.

Empirical experiments reported in the paper comparing LLMs of varying capability on forecasting tasks with superlinear growth and regime-change tail risk; uses distributional forecast evaluation across models (no sample size reported in excerpt).

high negative Is Capability a Liability? More Capable Language Models Make... distributional forecast quality / calibration

The lack of prediction stability and predictability can lead to advertiser-perceivable problems such as repeatability issues, cold start, and under-exploration.

Stated as an intuitive/motivational claim in the paper linking instability to advertiser-facing problems; no empirical quantification provided in the excerpt.

high negative LLM Retrieval for Stable and Predictable Ad Recommendations repeatability, cold start, under-exploration (advertiser-perceived issues)

Traditional ads recommendation systems have primarily focused on optimizing for prediction accuracy of click or conversion events using canonical metrics such as recall or normalized discounted cumulative gain (NDCG).

Background/contextual claim about prior work and standard practice; stated in the paper as motivation (no empirical evidence provided in the excerpt).

high negative LLM Retrieval for Stable and Predictable Ad Recommendations optimization focus on click/conversion prediction accuracy (recall, NDCG)

AIO is negatively associated with the carbon emission intensity of upstream suppliers.

Authors report a negative association between firms' AIO and the carbon emission intensity of their upstream suppliers in the empirical results using Chinese listed firms (2010–2023).

high negative Artificial intelligence orientation and decarbonization spil... carbon emission intensity (upstream suppliers)

AIO is negatively associated with the carbon emission intensity of industry peers.

Authors report a negative association between a firm's AIO and the carbon emission intensity of its industry peers based on their empirical analyses of Chinese listed companies over 2010–2023.

high negative Artificial intelligence orientation and decarbonization spil... carbon emission intensity (industry peers)

Stronger AIO is associated with lower carbon emission intensity within the focal firm.

Empirical association reported between firm-level AIO (measured via LLMs) and firm carbon emission intensity in the authors' analysis of Chinese listed firms (2010–2023); result described as a negative relationship.

high negative Artificial intelligence orientation and decarbonization spil... carbon emission intensity (focal firm)

Commercial demand drivers systematically distort finished-goods inventory targets and require integration with sales-and-operations planning for accurate calibration.

Narrative synthesis of studies addressing demand-driver effects on finished-goods targets and recommendations for S&OP integration.

high negative Equitable railway corridor investment under demand uncertain... accuracy/calibration of finished-goods inventory targets

Science-to-technology knowledge flow in AI has been insufficiently examined in a systematic and structural way.

Literature-gap claim in the paper motivating the study.

high negative Knowledge flows from science to AI technology: Identifying c... extent of systematic/structural study of science-to-technology knowledge flow in...

Highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation.

Synthesis of quantitative (coverage/reuse statistics) and qualitative analyses (narrative framing, taxonomy mapping) from the Benchmarking-Cultures-25 project; interpretive conclusion drawn by the authors.

high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... primary function of highlighted benchmarks (standardized measurement vs narrativ...

Authors of many 'general knowledge application' benchmarks claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math).

Content analysis of the benchmarks in the dataset showing topical focus (counts/observations indicating predominance of STEM/math topics) versus broader claimed measurement scope.

high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... topical focus of benchmark content (STEM/math prevalence) versus stated measurem...

Qualitative analysis shows many 'general knowledge application' benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI.

Qualitative content analysis of benchmark descriptions and builder narratives in the dataset; authors report themes where construct validity is downplayed and AGI progress is emphasized.

high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... degree of attention to construct validity vs AGI-framing in benchmark narratives

38.5% of highlighted benchmarks appear in just one release.

Quantitative analysis of the Benchmarking-Cultures-25 dataset (231 benchmarks); the paper reports the share (38.5%) of benchmarks that appear in only a single model release.

high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... durability/reuse of benchmarks across releases

The evaluation landscape is fragmented with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder.

Quantitative analysis of the Benchmarking-Cultures-25 dataset (231 benchmarks). The paper reports the share (63.2%) based on counts of builders per highlighted benchmark.

high negative Unsteady Metrics and Benchmarking Cultures of AI Model Build... degree of cross-model benchmark reuse (benchmarks per builder)

Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback, resulting in inefficient exploration and elevated financial risk for advertising platforms.

Argument in the paper contrasting generative-model-based approaches with the authors' proposed solution (conceptual claim; no quantitative backing given in the excerpt).

high negative Generative Auto-Bidding with Unified Modeling and Exploratio... exploration efficiency and financial risk in generative-model-based auto-bidding

Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies.

Statement in the paper summarizing limitations of prior RL-based bidding work (qualitative claim; no experimental details or sample size provided in the excerpt).

high negative Generative Auto-Bidding with Unified Modeling and Exploratio... ability of RL approaches to handle long-term dependencies

Early rule-based methods lacked adaptability.

Literature/contextual statement in the paper's introduction summarizing prior approaches to automated bidding (no empirical data or sample size reported).

high negative Generative Auto-Bidding with Unified Modeling and Exploratio... adaptability of early rule-based bidding methods

Twin agents dissolve that boundary, raising a class of trust calibration challenge these frameworks were not designed to handle.

Argument and design observations from the authors' ongoing project presented in the paper; conceptual claim explaining why existing frameworks may be insufficient for twin agents.

high negative From Role to Person: Trust Calibration Challenges in Twin Ag... framework_applicability_for_trust_calibration

When a human colleague doubts a twin agent's output, they face three failure modes (a schema gap, an epistemic gap, and a model artifact) with no reliable attribution path between them.

Conceptual taxonomy derived from the authors' early design observations; presented as an identified set of failure modes in the paper (qualitative, no numeric sample reported in abstract).

high negative From Role to Person: Trust Calibration Challenges in Twin Ag... failure_mode_attribution

Drawing on early design work in an ongoing project, we identify a trust calibration problem specific to this approach.

Based on the authors' early design work (qualitative/design research) described in the paper; no sample size or quantitative metrics reported in the abstract.

high negative From Role to Person: Trust Calibration Challenges in Twin Ag... trust_calibration

Major open challenges for responsible adoption include reliability, bias, privacy, automation bias, transparency, and evaluation.

Authors' identification of risks and open research challenges based on their review/analysis (conceptual synthesis).

high negative Rethinking Code Review in the Age of AI: A Vision for Agenti... list of key risks and challenges for AI adoption in code review

Current AI support for code review remains fragmented, with tools focusing on isolated tasks such as reviewer recommendation, PR description generation, or comment suggestion rather than the end-to-end PR review workflow.

Authors' survey/overview of existing AI tooling for code review described in the paper (conceptual / review-based evidence). No quantitative counts provided in the abstract.

high negative Rethinking Code Review in the Age of AI: A Vision for Agenti... completeness / fragmentation of AI tool coverage across PR review tasks

AI coding assistants expand the volume of code requiring review, turning code review into a growing bottleneck.

Authors' analytical claim linking increased code production from AI assistants to increased review workload; presented as an observed/trend claim in the paper rather than supported by a quantified study in the abstract.

high negative Rethinking Code Review in the Age of AI: A Vision for Agenti... volume of code requiring review / code review bottleneck

Code review has evolved for decades, from informal peer checking to today's pull request (PR) workflows, yet it remains a largely manual, uneven, and cognitively demanding process.

Authors' literature review and historical synthesis of code review practices presented in the paper (conceptual / review-based evidence). No empirical sample or experiment reported in the abstract.

high negative Rethinking Code Review in the Age of AI: A Vision for Agenti... manualness and cognitive demand of code review process

Challenges including algorithmic bias, data privacy concerns, high costs, and skill gaps persist across contexts.

Cross-study synthesis of barriers and challenges reported in the 21 included studies spanning multiple contexts.

high negative Application of Artificial Intelligence in Human Resource Man... prevalence of adoption barriers (bias, privacy, cost, skills)

SMEs face unique resource constraints yet lag in AI-HRM adoption.

Synthesis conclusion from the systematic review of 21 included studies (published 2019–2026) comparing adoption patterns and barriers for SMEs.

high negative Application of Artificial Intelligence in Human Resource Man... AI-HRM adoption (lag) and resource constraints

Greater automation can obscure rather than eliminate failure modes.

Analytical claim in paper arguing that increased automation hides failures; presented as an interpretive finding rather than a quantified experimental result in the excerpt.

high negative AI for Auto-Research: Roadmap & User Guide visibility or obscuration of failure modes under automation

End-to-end autonomous systems have not yet consistently reached major-venue acceptance standards.

Paper's statement based on review of acceptance/peer-review outcomes and standards as of April 2026; no numeric acceptance-rate data presented in the excerpt.

high negative AI for Auto-Research: Roadmap & User Guide consistency of meeting major-venue acceptance standards

Research code lags far behind pattern-matching benchmarks.

Paper's evaluative claim from its experiments/coding analysis indicating code produced for research tasks is weaker than benchmark performance on pattern-matching tasks; excerpt contains no numerical comparison.

high negative AI for Auto-Research: Roadmap & User Guide quality/performance of research code relative to pattern-matching benchmarks

Generated ideas often degrade after implementation.

Paper statement about the gap between idea generation and implemented results reported in the Creation-phase analysis; no quantified follow-up study reported in the excerpt.

high negative AI for Auto-Research: Roadmap & User Guide quality change of generated ideas after implementation

AI remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment.

Summary claim from the paper's end-to-end lifecycle analysis indicating limitations on novelty and experimental rigor; no numeric performance metrics provided in excerpt.

high negative AI for Auto-Research: Roadmap & User Guide robustness on novel ideas, research-level experiments, and scientific judgment

Frontier LLMs fail to judge novelty reliably.

Paper's claim from its Validation-phase analysis that models do not reliably assess novelty; excerpt contains no underlying experimental sample or validation metrics.

high negative AI for Auto-Research: Roadmap & User Guide reliability of novelty judgments

Frontier LLMs miss hidden errors.

Qualitative statement from paper indicating models fail to detect some latent or subtle errors in research artifacts; no numeric evaluation provided in excerpt.

high negative AI for Auto-Research: Roadmap & User Guide ability to detect hidden errors

Under scientific pressure, even frontier LLMs still fabricate results.

Reported observation in paper about model behavior under scientific-use conditions; no specific quantitative experiments or sample sizes given in the excerpt.

high negative AI for Auto-Research: Roadmap & User Guide incidence of fabricated results by LLMs

Diagnostics also reveal a small tail of extreme errors for the Random Forest model.

Model diagnostic analyses reported in the paper indicating error distribution and presence of extreme prediction errors (tail).

high negative Determinants of Successful IoT and AI Initiatives in the SMA... distribution of prediction errors (presence of extreme errors)

Unrestricted frontier-scale checkpoint synthesis remains open (i.e., not yet solved).

Authors' assessment in the abstract noting current limits; asserts that unrestricted synthesis at frontier/model-scale has not been achieved.

high negative Position: Weight Space Should Be a First-Class Generative AI... feasibility/status of unrestricted frontier-scale checkpoint synthesis

Both major deployed systems and designed mechanisms concentrate on the most observable and easiest-to-govern tier, while the forms of commercial influence most consequential for user autonomy remain poorly understood and lack frameworks for detection, measurement, or disclosure.

Critical review of deployed system design choices and governance mechanisms; authors assert that attention and tooling focus on observable product-mention-level interventions while higher-tier influences lack measurement and disclosure frameworks.

high negative Generative AI Advertising as a Problem of Trustworthy Commer... coverage of governance/mechanisms across influence tiers and the existence of fr...

These tiers instantiate across modalities and system architectures, including retrieval-augmented generation and agentic pipelines where upstream decisions can sharply constrain downstream outcomes.

Analytical claim supported by examples and discussion of system architectures (e.g., RAG, agentic pipelines) showing how interventions at different stages map to the taxonomy; no quantitative evaluation reported in excerpt.

high negative Generative AI Advertising as a Problem of Trustworthy Commer... presence of influence tiers across different system modalities and architectures

Generative AI fundamentally changes advertising: rather than placing products into discrete slots, it enables interventions on the generative process itself, which induce commercial influence through less observable channels.

Conceptual argument backed by analysis of how generative models produce outputs and how interventions can operate on latent variables of generation; illustrated via taxonomy in the paper rather than quantified empirical tests.

high negative Generative AI Advertising as a Problem of Trustworthy Commer... modes/channels of commercial influence in advertising systems

Empirical research shows that ads woven directly into large language model (LLM) outputs often go undetected by users.

Reference to prior empirical studies (unspecified in the excerpt) showing user failure to detect embedded ads in LLM outputs; presented as an empirical finding rather than new experimental data in this paper.

high negative Generative AI Advertising as a Problem of Trustworthy Commer... user detection/recognition of ads embedded in LLM outputs

Management shareholding and analyst attention amplify the debt-cost penalty faced by AI washing firms.

Heterogeneity/interaction analyses showing larger post-shock financing-cost increases for AI washing firms with higher management shareholding and greater analyst attention (descriptive of moderator effects; no sample sizes in abstract).

high negative Dissipation of Debt Financing Privilege on Corporate AI Wash... magnitude of debt financing cost penalty

Difference-in-differences estimations reveal that AI washing firms experience a 12.5 basis point relative increase in debt financing cost afterward.

Difference-in-differences estimations comparing AI washing firms to others before and after the FYP shock; effect reported as 12.5 basis points increase in debt financing cost (sample size not stated in abstract).

high negative Dissipation of Debt Financing Privilege on Corporate AI Wash... debt financing cost

« Prev 1 2 3 … 13 14 15 … 171 172 Next »