Evidence (8570 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Adoption Remove filter

A-insensitivity acts as a cognitive barrier between beliefs and trust (i.e., it reduces the extent to which beliefs about forecast accuracy are translated into trust).

Interpretation based on experimental findings showing that higher a-insensitivity weakens the predictive relationship between beliefs about accuracy and expressed trust in analysts (derived from measures and analyses in the lab experiment; sample size not reported in abstract).

high negative Trusting human versus machine predictions as a decision unde... belief-to-trust translation (strength of relationship between beliefs and trust)

Decision-makers who are more a-insensitive are less likely to incorporate their beliefs about forecast accuracy into their trust judgments.

Experimental data where participants' a-insensitivity was measured and used to predict the extent to which their beliefs (optimism about accuracy) translate into trust for analysts (moderation/interaction analysis implied; sample size not reported in abstract).

high negative Trusting human versus machine predictions as a decision unde... degree to which beliefs predict trust (belief–trust linkage)

There is a 'speedup illusion' where people have accurate forecasts of independent completion times but significantly underestimate AI-assisted times.

Empirical pattern reported in the abstract: comparison of predicted vs. actual times shows accurate independent forecasts but underestimation of AI-assisted completion times (preregistered study, N = 1237).

high negative Cognitive offloading and the speedup illusion in human-AI in... calibration of predicted vs actual completion time

A conventional two-arm test understates the algorithmic channel by a factor of two.

Empirical comparison reported in the paper between the three-arm design estimates and conventional two-arm test estimates from the live campaign.

high negative Algorithm or Creative? A Three-Arm Experimental Design for D... bias/understatement factor in estimated algorithmic effect from two-arm test

In the same campaign, the creative channel moves female impression share by -0.68 ppt.

Empirical result from the live Meta campaign reported in the paper; measured effect size (-0.68 percentage points).

high negative Algorithm or Creative? A Three-Arm Experimental Design for D... female impression share (change attributable to creative channel)

Adjusting for the realized audience is biased because audience is a post-treatment mediator.

Causal inference argument in paper explaining why conditioning on realized audience induces bias (audience as post-treatment mediator).

high negative Algorithm or Creative? A Three-Arm Experimental Design for D... bias from post-treatment adjustment

Every two-arm test conflates the creative's effect with the algorithm's targeting response.

Theoretical/causal argument presented in the paper about confounding in standard two-arm experiments when algorithmic delivery is endogenous.

high negative Algorithm or Creative? A Three-Arm Experimental Design for D... confounding/bias in estimated creative effect

Simultaneously, there is a structural shortage of qualified personnel and a gap between the education system and the needs of the economy in Uzbekistan.

Synthesis of statistical data, industry reviews, and regulatory/legal document analysis presented in the paper (no primary survey/sample size reported).

high negative The Impact of Artificial Intelligence During the Transformat... shortage of qualified personnel and education–economy skills gap

As these systems scale, the bottleneck shifts away from raw model capability toward coordination.

Analytical/argumentative claim in the paper framing a shift in primary constraint; no empirical study or quantified benchmark reported.

high negative Foundation Protocol: A Coordination Layer for Agentic Societ... primary system bottleneck (model capability versus coordination capacity)

AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up.

Statement in paper's motivation/background; no empirical method or sample size reported in the abstract.

high negative XWind: A Cross-site Router for Large Language Model Inferenc... strain on power grids relative to AI power demand

A reported limitation is that at this privacy level the released valuations remain noise-dominated; the system's utility derives primarily from public index routing and adaptive scheduling driven by low-sensitivity statistics.

Authors' limitation/analysis section and experimental observations.

high negative CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolv... utility source (valuation signal vs. public index routing/adaptive scheduling)

Static temporal knowledge-graph data marketplace designs suffer three coupled failures: (i) stale hybrid index shortcuts reduce recall as edges evolve, (ii) stationary Shapley pricing misattributes value after distribution shifts, and (iii) uncoordinated agents over-consume a shared differential-privacy budget.

Authors' problem statement / conceptual diagnosis presented in the paper (no numeric sample size reported).

high negative CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolv... marketplace failures (recall reduction, pricing misattribution, privacy budget o...

Monotonic baselines collapse when extrapolating beyond the training regime (e.g., predicting a 12B model up to 307B tokens) whereas the Shannon Scaling Law remains predictive.

Empirical comparison on the held-out 12B extrapolation: authors report collapse/failure of monotonic baseline scaling laws in that regime contrasted with Shannon law's successful prediction (pooled R^2 reported).

high negative LLMs as Noisy Channels: A Shannon Perspective on Model Capac... extrapolative predictive failure/success of baseline vs proposed scaling laws

This Shannon perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation.

Theoretical argument derived from the Shannon-Hartley based formulation plus supporting empirical examples claimed in the paper showing non-monotonic (U-shaped) loss/accuracy behavior when SNR is insufficient.

high negative LLMs as Noisy Channels: A Shannon Perspective on Model Capac... performance vs. scale behavior (transition from monotonic improvement to U-shape...

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute.

Author assertion based on literature/contextual observation and motivating examples (catastrophic overtraining, quantization-induced degradation) referenced in the paper; no specific numeric sample provided in the excerpt.

high negative LLMs as Noisy Channels: A Shannon Perspective on Model Capac... ability of prior scaling laws to explain non-monotonic performance phenomena (e....

Commercial or dual-use AI models and semiconductors do not meet the security exception criteria under GATT Article XXI(b), so security interests should be interpreted restrainedly.

Legal argument and interpretive analysis in the paper contending that the GATT Article XXI(b) security exception does not encompass routine commercial or dual-use AI models and semiconductors; doctrinal legal reasoning rather than empirical measurement.

high negative Strategic Stalemates: The Paradox of Export Controls in the ... applicability of GATT Article XXI(b) security exception to dual-use/commercial A...

Overusing export controls can complicate dispute resolution and hinder AI progress.

Normative and legal-political argument in the paper: overuse raises legal disputes (e.g., WTO litigation) and may slow cross-border AI development and diffusion (qualitative reasoning).

high negative Strategic Stalemates: The Paradox of Export Controls in the ... frequency/complexity of trade disputes and pace of AI progress/development

Overly strict or arbitrary controls may violate WTO obligations.

Legal analysis in the paper arguing that some export controls could conflict with WTO law (GATT) depending on scope and justification; interpretive legal reasoning cited.

high negative Strategic Stalemates: The Paradox of Export Controls in the ... compatibility of export controls with WTO obligations

The long-term effectiveness of export controls is questionable.

Paper's argumentative assessment drawing on historical examples and theoretical considerations (qualitative reasoning rather than quantitative causal inference).

high negative Strategic Stalemates: The Paradox of Export Controls in the ... effectiveness of export controls over the long term

China responded with export curbs on critical minerals and filed a WTO complaint against the U.S. under GATT.

Factual claim citing China's counter-measures (export curbs) and legal action (WTO complaint under GATT) as described in the paper.

high negative Strategic Stalemates: The Paradox of Export Controls in the ... China's retaliatory trade measures and litigation

Large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks but their deployment in high-throughput, latency-sensitive environments remains impractical.

Statement about model performance on public benchmarks (upper bounds) and practical deployment constraints (throughput and latency), asserted by authors; no numerical deployment analysis provided in excerpt.

high negative HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... deployability / practicality in latency-sensitive, high-throughput environments

Two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength.

Comparison of jaggedness/local volatility measures and overall scores from the tournament (top-three leaderboard).

high negative GENSTRAT: Toward a Science of Strategic Reasoning in Large L... local volatility / jaggedness

Existing strategic-reasoning benchmarks evaluate models on fixed canonical games and may saturate as the frontier improves and fail to generalize to varied real-world strategic environments.

Conceptual critique stated in the paper's motivation/background; no empirical test reported in abstract.

high negative GENSTRAT: Toward a Science of Strategic Reasoning in Large L... benchmark generalizability / benchmark saturation

Evaluating state-of-the-art kernel agents on FastKernels, the strongest agent achieves only 0.94× aggregate speedup over production baselines, with weaker agents at 0.78× and 0.53×.

Empirical evaluation of multiple state-of-the-art kernel-generation agents on the FastKernels benchmark; aggregate speedup factors reported in abstract. The number of benchmark tasks is likely the FastKernels task set (46), though the abstract does not explicitly state the evaluation sample size for this measurement.

high negative FastKernels: Benchmarking GPU Kernel Generation in Productio... aggregate runtime speedup relative to production baselines

Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones.

Stated as motivating observation in the paper (conceptual/empirical critique of existing benchmark design and incentives). No numerical sample size given in the abstract.

high negative FastKernels: Benchmarking GPU Kernel Generation in Productio... benchmark-production alignment

Other changes are more nuanced and put the typical career growth opportunities, like receiving feedback from professional networks and promoting leadership and mentorship, at risk.

Qualitative reports from interview participants (n=24) expressing concerns that AI-driven changes may reduce feedback, leadership development, and mentoring opportunities.

high negative Beyond the Org Chart: AI and the Transformation of Invisible... access to feedback, leadership development, mentorship (career growth opportunit...

Notable challenges to AI implementation include concerns about algorithmic bias, privacy, transparency, job displacement, organizational culture, and issues related to ethical and legal oversight.

Synthesis of reported challenges across the 29 empirical studies included in the scoping review.

high negative The influence of AI-Driven Employee Performance Management (... implementation barriers and risks (bias, privacy, transparency, displacement, cu...

Zero-shot evaluation shows the best positive-query mask success rate at IoU@0.75 remains below 0.17.

Empirical evaluation reported in the paper: zero-shot tests across 26 model configurations with reported mask success rate at IoU@0.75.

high negative AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... positive-query mask success rate at IoU@0.75

Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-F1 reaches only 0.35.

Empirical evaluation reported in the paper: zero-shot tests across 26 model configurations with reported Set-F1 metric.

high negative AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... multi-target Set-F1

Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image.

Problem characterization / motivation described in the paper (qualitative reasoning about dataset and task properties).

high negative AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... difficulty of reliable evaluation for agricultural visual grounding

Technical bottlenecks (cross-border data compliance, algorithm interpretability) and ethical challenges (algorithmic bias, privacy infringement, cultural conflicts) are intertwined impediments to intelligent international marketing.

Synthesis of challenges identified across the reviewed literature (systematic review and content analysis, 2010–2025) as reported in the paper.

high negative Research on International Marketing in the Context of Intell... presence and interrelation of technical and ethical barriers

Traditional international marketing theories, constrained by static assumptions and linear logic, struggle to explain intelligent contexts.

Conclusion from the paper's systematic review and content analysis of core literature (2010–2025); no quantitative test or sample size reported in the summary.

high negative Research on International Marketing in the Context of Intell... theoretical explanatory adequacy of traditional international marketing theories

Cost and lack of applicable use case are the most cited barriers to AI adoption, followed by expertise.

Survey question(s) on barriers to adoption in the Census Bureau survey in which respondents reported reasons for not adopting AI; ranking provided in the paper (cost, lack of use case, then expertise).

high negative The Adoption of Industrial AI in America reported barriers to AI adoption (cost, applicability, expertise)

Intensity-weighted adoption is far lower than the 22.8 percent headline rate.

Survey-derived intensity-weighted measure of AI adoption constructed from the same Census Bureau survey (no numeric value reported in the excerpt).

high negative The Adoption of Industrial AI in America intensity-weighted AI adoption

Only 22.8 percent of plants report any AI use as of 2021.

Direct descriptive estimate from the Census Bureau survey of manufacturing establishments; year reported as 2021.

high negative The Adoption of Industrial AI in America share of plants reporting any AI use

ID-centric ranking models fail to generalize in livestreaming recommendation due to the short-lived nature of live rooms and poorly learned item IDs.

Authors' assertion linking the cold-start item ID problem to poor generalization of ID-centric rankers (motivating claim). No specific experimental metrics or sample sizes cited in the excerpt.

high negative FLUID: From Ephemeral IDs to Multimodal Semantic Codes for I... generalization performance of ID-centric ranking models

A live room typically broadcasts for only tens of minutes, so its item ID remains poorly learned in a persistent cold-start state.

Authors' observational/operational claim about livestream characteristics stated in the paper (motivating problem statement). No sample size or quantitative backing provided in the excerpt.

high negative FLUID: From Ephemeral IDs to Multimodal Semantic Codes for I... cold-start state of item IDs (poorly learned embeddings)

The de-coring and skill-demand changes are concentrated among low entry-threshold, small firms.

Abstract statement reporting heterogeneity: concentration of observed patterns among firms characterized as small and with low entry thresholds.

high negative Toward Sustainable Workforce Development: How AI Reshapes Sk... heterogeneity of skill-demand changes by firm size and entry-threshold (concentr...

Both displacement and augmentation exposure are associated with a de-coring pattern: a shallower and more dispersed skill portfolio with within-category importance diverging from share movements.

Empirical description in abstract that both forms of exposure correlate with changes in portfolio depth and dispersion, and with divergence between within-category importance and category shares.

high negative Toward Sustainable Workforce Development: How AI Reshapes Sk... skill portfolio depth and dispersion; divergence between within-category importa...

Displacement exposure is negatively associated with the routine cognitive skill share.

Empirical result stated in abstract: negative association between displacement exposure and routine cognitive share, identified using within-firm variation and the constructed exposure measures.

high negative Toward Sustainable Workforce Development: How AI Reshapes Sk... routine cognitive skill share (share of demand for routine cognitive tasks/skill...

In deployed settings, the effects of AI systems on human agency, creativity, and institutional well-being emerge over time, shaped by repeated interaction, reuse, and integration into real-world workflows, and these dynamics are rarely visible through pre-deployment evaluation or isolated prompt–response analysis.

Argumentative observation based on conceptual reasoning; no empirical data or sample size reported.

high negative Post-Deployment Observability as a Foundation for Well-Being... emergent effects on human agency and creativity arising from extended AI use

The most significant barriers to AI adoption reported by entrepreneurs are human-centred—talent scarcity, organisational resistance, and change management—rather than technology or cost alone.

Theme 'Barriers and the Adoption Journey' from thematic analysis of interviews (n=16); interviewees repeatedly cited human-centred barriers (talent scarcity, resistance, change management) over purely technical/cost barriers.

high negative Navigating the Intelligence Frontier: AI Adoption as a Succe... adoption barriers (human-centred constraints)

Because contracts are negotiated by legal departments alone, many apparent legal disputes are incentive misalignment problems that only scientists at the table can correctly diagnose.

Argumentative claim presented in the paper (normative/diagnostic); no empirical study or sample provided in the excerpt.

high negative Position: The Pre/Post-Training Boundary Should Govern IP in... quality of contract negotiations / correct diagnosis of incentives in disputes

These failures are not for scientific reasons, but because academics must publish while companies must protect models trained on proprietary data, and no standard contract framework resolves this tension.

The paper presents this as the causal explanation (analytical/argumentative claim); no empirical testing or sample reported in the provided text.

high negative Position: The Pre/Post-Training Boundary Should Govern IP in... incentive alignment between academic publication requirements and company IP pro...

Industry-academia ML collaborations routinely fail to launch.

Asserted in the paper as an empirical observation/statement; no empirical methods, data, or sample size reported in the provided text (argument/anecdote).

high negative Position: The Pre/Post-Training Boundary Should Govern IP in... success rate of launching industry-academia ML collaborations

People exhibit self-estimate miscalibration: on average they believe they are using AI less than they actually are.

Same three pre-registered user studies (combined N = 2691) comparing participants' self-reported AI use against observed/recorded AI use during tasks.

high negative The efficiency-gain illusion: People underestimate the rate ... discrepancy_between_self_reported_and_actual_AI_use

The measurement bias understates substitution effects more than it understates augmentation effects.

Analytical argument and empirical evidence showing directional bias from measurement error that causes estimated substitution (labor displacement) effects to be more severely understated than augmentation (complementarity) effects.

high negative Who Uses AI? Platforms, Workforce, and AI Exposure relative bias in estimated substitution versus augmentation effects on employmen...

Reweighting platform-based exposure measures to Bureau of Labor Statistics workforce shares attenuates estimates by 42 to 93 percent.

Reweighting exercise where exposure scores built from platform logs are reweighted to match BLS workforce shares and resulting employment estimates are compared; reported attenuation range of 42–93%.

high negative Who Uses AI? Platforms, Workforce, and AI Exposure magnitude of employment estimates (attenuation after reweighting)

Current regulatory frameworks—designed for human-intermediated payments—are ill-equipped to address the dynamic and decentralised nature of agent-led transactions.

Regulatory and legal analysis asserted in the abstract (argument that existing frameworks are mismatched to agent-led payments).

high negative AI Agents in Payments: Applications, Risks and Regulations adequacy of existing regulatory frameworks for agent-led transactions

The article identifies and categorises a range of technical, legal and societal risks, including cybersecurity vulnerabilities, liability gaps, regulatory non-compliance, and potential economic disruption.

Risk identification and categorisation presented in the paper (qualitative analysis and case studies referenced in the abstract). No quantitative risk measurement reported in the abstract.

high negative AI Agents in Payments: Applications, Risks and Regulations technical, legal and societal risks (cybersecurity, liability, regulatory non-co...

« Prev 1 2 3 … 12 13 14 … 171 172 Next »