Evidence (16496 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	870	233	116	1066	2363
Governance & Regulation	976	451	218	133	1809
Organizational Efficiency	949	224	144	88	1416
Technology Adoption Rate	764	287	141	122	1325
Research Productivity	501	152	74	362	1101
Output Quality	542	216	69	69	896
Decision Quality	387	198	94	54	740
Firm Productivity	513	67	101	27	714
AI Safety & Ethics	249	303	73	36	667
Market Structure	190	192	134	27	548
Task Allocation	243	77	91	36	452
Innovation Output	291	33	55	20	401
Skill Acquisition	206	72	65	21	364
Employment Level	133	63	115	22	335
Fiscal & Macroeconomic	153	79	52	32	323
Task Completion Time	206	37	12	15	272
Firm Revenue	179	52	29	5	266
Consumer Welfare	130	76	47	13	266
Inequality Measures	48	137	51	6	242
Worker Satisfaction	101	81	25	13	220
Error Rate	84	110	11	5	210
Wages & Compensation	98	47	30	10	185
Regulatory Compliance	88	73	17	7	185
Automation Exposure	66	64	33	16	182
Team Performance	105	29	30	11	176
Training Effectiveness	109	22	14	21	168
Developer Productivity	114	21	14	8	158
Job Displacement	12	90	24	1	127
Hiring & Recruitment	57	9	9	5	80
Skill Obsolescence	6	56	9	1	72
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	21	17	1	57
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Adjusting for the realized audience is biased because audience is a post-treatment mediator.

Causal inference argument in paper explaining why conditioning on realized audience induces bias (audience as post-treatment mediator).

high negative Algorithm or Creative? A Three-Arm Experimental Design for D... bias from post-treatment adjustment

Every two-arm test conflates the creative's effect with the algorithm's targeting response.

Theoretical/causal argument presented in the paper about confounding in standard two-arm experiments when algorithmic delivery is endogenous.

high negative Algorithm or Creative? A Three-Arm Experimental Design for D... confounding/bias in estimated creative effect

Simultaneously, there is a structural shortage of qualified personnel and a gap between the education system and the needs of the economy in Uzbekistan.

Synthesis of statistical data, industry reviews, and regulatory/legal document analysis presented in the paper (no primary survey/sample size reported).

high negative The Impact of Artificial Intelligence During the Transformat... shortage of qualified personnel and education–economy skills gap

As these systems scale, the bottleneck shifts away from raw model capability toward coordination.

Analytical/argumentative claim in the paper framing a shift in primary constraint; no empirical study or quantified benchmark reported.

high negative Foundation Protocol: A Coordination Layer for Agentic Societ... primary system bottleneck (model capability versus coordination capacity)

More persuasive narratives may have had a detrimental effect on the ability to discriminate between a correct and incorrect AI prediction.

Exploratory analyses in the paper reporting reduced discrimination between correct and incorrect AI predictions when narratives were more persuasive.

high negative Human Decision-Making with Persuasive and Narrative LLM Expl... ability to discriminate correct vs. incorrect AI predictions

More persuasive narratives may have had a detrimental effect on decision response times.

Exploratory analyses reported in the paper indicating persuasive narratives were associated with longer decision response times.

high negative Human Decision-Making with Persuasive and Narrative LLM Expl... decision response time

Higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings.

Argument based on review of current knowledge-work evaluation and benchmark design literature; paper motivates with conceptual analysis and references to empirical work showing mismatch between benchmark tasks and deployed work settings.

high negative Design and Report Benchmarks for Knowledge Work ability of a system to carry out knowledge work in real-world deployment setting...

AI systems intended to simulate companionship or emotional responsiveness raise risks such as emotional manipulation, addictive interaction patterns, and potential impact of prolonged AI interaction on users’ mental well-being, particularly for vulnerable users.

Asserted risk statement in policy recommendations; no empirical study, prevalence data, or sample provided in the text.

high negative Governing Relational AI: China’s Regulation of Anthropomorph... psychological safety (emotional manipulation, addiction, mental well-being impac...

Current systems still struggle with evidence preservation, reproducibility, weak-direction rejection, provenance tracking, cross-domain robustness, and accountable scientific closure.

Survey-identified recurring failure modes and limitations reported in literature and system descriptions; qualitative synthesis.

high negative AutoResearch AI: Towards AI-Powered Research Automation for ... capabilities related to evidence preservation, reproducibility, rejection of wea...

Current systems remain fragmented, differing in autonomy, domain scope, execution environment, validation mechanism, and human oversight.

Survey of existing systems and categorization across the listed dimensions; descriptive synthesis rather than an empirical meta-analysis.

high negative AutoResearch AI: Towards AI-Powered Research Automation for ... heterogeneity/fragmentation across AI research systems along autonomy, domain sc...

AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up.

Statement in paper's motivation/background; no empirical method or sample size reported in the abstract.

high negative XWind: A Cross-site Router for Large Language Model Inferenc... strain on power grids relative to AI power demand

The potential widening of the gender wage gap would operate through existing patterns of gender-based occupational sorting (i.e., because women are concentrated in occupations more exposed to generative AI).

Mechanistic interpretation supported by the combination of descriptive occupational sorting evidence from Swedish administrative data and results from the partial-equilibrium simulations incorporating predicted AI exposure and task complementarity.

high negative <scp>Pre‐AI</scp> Sorting, ... mechanism linking occupational sorting to changes in gender wage gap

Mechanical partial-equilibrium simulations indicate that generative AI may widen the gender wage gap.

Counterfactual simulations (mechanical partial-equilibrium) based on hypothesized deviations from the 2021 occupational and wage distribution, incorporating predicted AI exposure and task complementarity; applied to Swedish context.

high negative <scp>Pre‐AI</scp> Sorting, ... gender wage gap (changes in wages by gender)

Women are overrepresented in occupations predicted to be more affected by generative AI (using pre-ChatGPT occupational sorting).

Descriptive analysis of Swedish administrative data characterizing occupational gender composition before the release of ChatGPT and mapping occupations to predicted exposure to generative AI.

high negative <scp>Pre‐AI</scp> Sorting, ... predicted exposure to generative AI by occupation / gender representation in hig...

A reported limitation is that at this privacy level the released valuations remain noise-dominated; the system's utility derives primarily from public index routing and adaptive scheduling driven by low-sensitivity statistics.

Authors' limitation/analysis section and experimental observations.

high negative CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolv... utility source (valuation signal vs. public index routing/adaptive scheduling)

Static temporal knowledge-graph data marketplace designs suffer three coupled failures: (i) stale hybrid index shortcuts reduce recall as edges evolve, (ii) stationary Shapley pricing misattributes value after distribution shifts, and (iii) uncoordinated agents over-consume a shared differential-privacy budget.

Authors' problem statement / conceptual diagnosis presented in the paper (no numeric sample size reported).

high negative CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolv... marketplace failures (recall reduction, pricing misattribution, privacy budget o...

Monotonic baselines collapse when extrapolating beyond the training regime (e.g., predicting a 12B model up to 307B tokens) whereas the Shannon Scaling Law remains predictive.

Empirical comparison on the held-out 12B extrapolation: authors report collapse/failure of monotonic baseline scaling laws in that regime contrasted with Shannon law's successful prediction (pooled R^2 reported).

high negative LLMs as Noisy Channels: A Shannon Perspective on Model Capac... extrapolative predictive failure/success of baseline vs proposed scaling laws

This Shannon perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation.

Theoretical argument derived from the Shannon-Hartley based formulation plus supporting empirical examples claimed in the paper showing non-monotonic (U-shaped) loss/accuracy behavior when SNR is insufficient.

high negative LLMs as Noisy Channels: A Shannon Perspective on Model Capac... performance vs. scale behavior (transition from monotonic improvement to U-shape...

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute.

Author assertion based on literature/contextual observation and motivating examples (catastrophic overtraining, quantization-induced degradation) referenced in the paper; no specific numeric sample provided in the excerpt.

high negative LLMs as Noisy Channels: A Shannon Perspective on Model Capac... ability of prior scaling laws to explain non-monotonic performance phenomena (e....

Commercial or dual-use AI models and semiconductors do not meet the security exception criteria under GATT Article XXI(b), so security interests should be interpreted restrainedly.

Legal argument and interpretive analysis in the paper contending that the GATT Article XXI(b) security exception does not encompass routine commercial or dual-use AI models and semiconductors; doctrinal legal reasoning rather than empirical measurement.

high negative Strategic Stalemates: The Paradox of Export Controls in the ... applicability of GATT Article XXI(b) security exception to dual-use/commercial A...

Overusing export controls can complicate dispute resolution and hinder AI progress.

Normative and legal-political argument in the paper: overuse raises legal disputes (e.g., WTO litigation) and may slow cross-border AI development and diffusion (qualitative reasoning).

high negative Strategic Stalemates: The Paradox of Export Controls in the ... frequency/complexity of trade disputes and pace of AI progress/development

Overly strict or arbitrary controls may violate WTO obligations.

Legal analysis in the paper arguing that some export controls could conflict with WTO law (GATT) depending on scope and justification; interpretive legal reasoning cited.

high negative Strategic Stalemates: The Paradox of Export Controls in the ... compatibility of export controls with WTO obligations

The long-term effectiveness of export controls is questionable.

Paper's argumentative assessment drawing on historical examples and theoretical considerations (qualitative reasoning rather than quantitative causal inference).

high negative Strategic Stalemates: The Paradox of Export Controls in the ... effectiveness of export controls over the long term

China responded with export curbs on critical minerals and filed a WTO complaint against the U.S. under GATT.

Factual claim citing China's counter-measures (export curbs) and legal action (WTO complaint under GATT) as described in the paper.

high negative Strategic Stalemates: The Paradox of Export Controls in the ... China's retaliatory trade measures and litigation

Even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications.

Empirical evaluation reported in the paper comparing two SOTA coding agents on a suite of 7 distributed key-value-store specifications; success counted as meeting the specification.

high negative Inductive Deductive Synthesis: Enabling AI to Generate Forma... output_quality

Large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks but their deployment in high-throughput, latency-sensitive environments remains impractical.

Statement about model performance on public benchmarks (upper bounds) and practical deployment constraints (throughput and latency), asserted by authors; no numerical deployment analysis provided in excerpt.

high negative HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... deployability / practicality in latency-sensitive, high-throughput environments

Rule debt is a governance burden that accrues when organizational decision rules migrate from formal information systems into ungoverned agentic execution environments.

Conceptual construct introduced and defined in the paper; supported by illustrative examples, no empirical measurement reported.

high negative Redrawing the AI Map: A Theory of Accountability Boundaries ... governance burden (rule debt)

AI-enabled capabilities whose outputs require evidence, review, signoff, or assignable responsibility may retain integrated accountability boundaries even when their technical interfaces become modular.

Theoretical claim supported by conceptual analysis and domain illustrations; no empirical sample or formal measurement reported.

high negative Redrawing the AI Map: A Theory of Accountability Boundaries ... placement of accountability boundaries (integration vs modularization)

A complementary Oaxaca–Blinder decomposition shows that shifts in occupational composition account for about 90% of the exposure change attributable to observable job characteristics.

Oaxaca–Blinder decomposition reported in the paper attributing ~90% of exposure change (among the portion explained by observable job characteristics) to occupational composition shifts.

high negative Generative AI and the Reorganization of Labor Demand fraction of exposure change (attributable to observable job characteristics) exp...

Within-job redesign accounts for 39.5% of the aggregate decline in generative-AI exposure and becomes increasingly important over time.

Same decomposition as above reported in the paper (result: within-job redesign = 39.5% of aggregate decline; authors note its increasing importance).

high negative Generative AI and the Reorganization of Labor Demand share of aggregate decline in generative-AI exposure explained by within-job red...

Hiring reallocation explains the largest share of the aggregate decline in generative-AI exposure, accounting for 52% on average.

Decomposition of changes in aggregate exposure into two margins (reallocation across jobs and within-job redesign) reported in the paper (result: hiring reallocation = 52% of aggregate decline).

high negative Generative AI and the Reorganization of Labor Demand share of aggregate decline in generative-AI exposure explained by hiring realloc...

We argue that regions are unlikely to maximize all three [Progress, Sustainability, Equity] simultaneously under current technological, institutional, and resource conditions.

Argument based on synthesis of prior literature on limits of AI development and illustrative evidence (regional cases and stakeholder comment analysis); explicitly stated in the abstract.

high negative The AI Infrastructure Triad in Regional Governance: How Regi... ability of regions to simultaneously maximize Progress, Sustainability, and Equi...

The rapid expansion of artificial intelligence infrastructure, including data centers and the energy, land, water, and labor systems that support them, presents regional policymakers with trade-offs that are poorly captured by the prevailing "innovation versus regulation" frame.

Conceptual argument drawing on prior literature and illustrative regional examples presented in the paper; stated explicitly in the abstract.

high negative The AI Infrastructure Triad in Regional Governance: How Regi... degree to which regional policy trade-offs are captured by the 'innovation vs re...

Two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength.

Comparison of jaggedness/local volatility measures and overall scores from the tournament (top-three leaderboard).

high negative GENSTRAT: Toward a Science of Strategic Reasoning in Large L... local volatility / jaggedness

Existing strategic-reasoning benchmarks evaluate models on fixed canonical games and may saturate as the frontier improves and fail to generalize to varied real-world strategic environments.

Conceptual critique stated in the paper's motivation/background; no empirical test reported in abstract.

high negative GENSTRAT: Toward a Science of Strategic Reasoning in Large L... benchmark generalizability / benchmark saturation

Evaluating state-of-the-art kernel agents on FastKernels, the strongest agent achieves only 0.94× aggregate speedup over production baselines, with weaker agents at 0.78× and 0.53×.

Empirical evaluation of multiple state-of-the-art kernel-generation agents on the FastKernels benchmark; aggregate speedup factors reported in abstract. The number of benchmark tasks is likely the FastKernels task set (46), though the abstract does not explicitly state the evaluation sample size for this measurement.

high negative FastKernels: Benchmarking GPU Kernel Generation in Productio... aggregate runtime speedup relative to production baselines

Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones.

Stated as motivating observation in the paper (conceptual/empirical critique of existing benchmark design and incentives). No numerical sample size given in the abstract.

high negative FastKernels: Benchmarking GPU Kernel Generation in Productio... benchmark-production alignment

Other changes are more nuanced and put the typical career growth opportunities, like receiving feedback from professional networks and promoting leadership and mentorship, at risk.

Qualitative reports from interview participants (n=24) expressing concerns that AI-driven changes may reduce feedback, leadership development, and mentoring opportunities.

high negative Beyond the Org Chart: AI and the Transformation of Invisible... access to feedback, leadership development, mentorship (career growth opportunit...

Integrations of AI that neglect human factors are associated with increased anxiety, burnout, and disengagement among users.

Aggregate findings from the systematic review reporting associations in the literature between non-human-centered AI integration and negative psychological/work outcomes.

high negative Yapay Zeka Sistemleri ve İnsan İşbirliğinin Psikolojik, Sosy... anxiety, burnout, disengagement

Notable challenges to AI implementation include concerns about algorithmic bias, privacy, transparency, job displacement, organizational culture, and issues related to ethical and legal oversight.

Synthesis of reported challenges across the 29 empirical studies included in the scoping review.

high negative The influence of AI-Driven Employee Performance Management (... implementation barriers and risks (bias, privacy, transparency, displacement, cu...

Fragmented, uncoordinated approaches in the absence of national strategy constitute a structural barrier to technological development in Georgia.

Method: logical inference and country assessment presented in the paper documenting fragmentation across policy and institutional actors; qualitative evidence rather than quantitative causal estimation.

high negative Economic Impact of Artificial Intelligence and Policy Framew... barriers to technological development / policy fragmentation

In Georgia, the total absence of a national AI strategy and legal definition produces fragmented approaches, creating a structural barrier to technological development.

Method: country-level assessment of policy and legal framework for AI in Georgia; descriptive analysis identifying lack of a national strategy and definition. (No sample size reported.)

high negative Economic Impact of Artificial Intelligence and Policy Framew... technological development / policy coherence

This transition proceeds without tools to forecast how individual employees will respond psychologically and behaviorally.

Asserted by the authors as a gap/need; no empirical inventory or systematic review presented in the excerpt to substantiate completeness of tool absence.

high negative Toward an AI-Powered Computational Testbed for Workforce Pol... availability of forecasting tools for individual employees' psychological and be...

Workforce transformations are difficult to forecast and costly to mismanage.

Stated as a general assertion in the paper's introduction; no empirical data, sample, or formal analysis reported in the excerpt.

high negative Toward an AI-Powered Computational Testbed for Workforce Pol... forecastability of workforce transformations and costs of mismanagement

Student-designed tasks reveal hidden failures in current deep research systems: fluent, source-backed answers can still miss the right query, source, term, or evidence standard.

Qualitative analysis of failure modes from student-designed tasks and system evaluations reported in the paper (examples and discussion of how answers can be fluent and sourced yet incorrect on key criteria).

high negative Teaching AI Through Benchmark Construction: QuestBench as a ... types of model failure (mismatch on query, source selection, terminology, eviden...

Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%.

Empirical evaluation reported in the paper: 13 systems evaluated on QuestBench; aggregated mean question-level pass rate reported as 16.85%.

high negative Teaching AI Through Benchmark Construction: QuestBench as a ... question-level pass rate (model performance on benchmark)

Zero-shot evaluation shows the best positive-query mask success rate at IoU@0.75 remains below 0.17.

Empirical evaluation reported in the paper: zero-shot tests across 26 model configurations with reported mask success rate at IoU@0.75.

high negative AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... positive-query mask success rate at IoU@0.75

Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-F1 reaches only 0.35.

Empirical evaluation reported in the paper: zero-shot tests across 26 model configurations with reported Set-F1 metric.

high negative AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... multi-target Set-F1

Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image.

Problem characterization / motivation described in the paper (qualitative reasoning about dataset and task properties).

high negative AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... difficulty of reliable evaluation for agricultural visual grounding

Technical bottlenecks (cross-border data compliance, algorithm interpretability) and ethical challenges (algorithmic bias, privacy infringement, cultural conflicts) are intertwined impediments to intelligent international marketing.

Synthesis of challenges identified across the reviewed literature (systematic review and content analysis, 2010–2025) as reported in the paper.

high negative Research on International Marketing in the Context of Intell... presence and interrelation of technical and ethical barriers

« Prev 1 2 3 … 38 39 40 … 329 330 Next »