Evidence (15198 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	806	212	105	975	2164
Governance & Regulation	898	417	197	128	1671
Organizational Efficiency	865	210	132	88	1306
Technology Adoption Rate	703	265	130	115	1224
Research Productivity	474	140	65	350	1041
Output Quality	507	197	61	53	818
Decision Quality	358	181	86	52	684
AI Safety & Ethics	245	294	71	34	650
Firm Productivity	465	60	93	22	646
Market Structure	188	173	126	25	517
Task Allocation	225	72	78	34	414
Innovation Output	246	30	48	18	344
Skill Acquisition	182	67	62	18	329
Employment Level	112	57	110	13	294
Fiscal & Macroeconomic	137	72	45	28	289
Firm Revenue	175	50	28	5	259
Consumer Welfare	122	71	46	13	252
Task Completion Time	187	34	10	14	246
Inequality Measures	45	127	50	6	228
Worker Satisfaction	95	75	23	12	205
Error Rate	77	98	11	4	190
Regulatory Compliance	84	73	17	7	181
Automation Exposure	61	63	27	14	168
Training Effectiveness	98	21	14	19	154
Team Performance	93	18	28	11	151
Wages & Compensation	79	39	25	7	150
Developer Productivity	105	18	14	6	144
Job Displacement	12	84	23	1	120
Hiring & Recruitment	53	8	8	3	72
Skill Obsolescence	6	51	9	1	67
Social Protection	40	17	8	2	67
Creative Output	32	20	8	3	64
Labor Share of Income	17	20	17	1	55
Worker Turnover	15	15	—	3	33
Industry	—	—	—	1	1

When firms rationally substitute AI for labor, aggregate labor income can fall and lower demand, which accelerates further AI substitution — a 'displacement spiral' whose net feedback is either self-limiting (convergent) or explosive (runaway adoption + demand collapse) depending on AI capability growth rate, diffusion speed across firms/sectors, and the reinstatement rate (rate at which new paid human roles or demand reappear).

Formal model derivations that identify key parameters and inequalities separating convergent vs explosive regimes; calibrated simulations that vary capability growth, diffusivity, and reinstatement elasticity to produce different phase outcomes.

medium negative Abundant Intelligence and Deficient Demand: A Macro-Financia... aggregate labor income; AI adoption rate; regime outcome (convergent vs explosiv...

Rapid AI adoption can create a macro-financial stress scenario not primarily through productivity collapse or existential risk but via a distribution-and-contract mismatch: AI-generated abundance reduces the need for human cognitive labor while institutions (wage contracts, credit, consumption patterns, financial intermediation) remain anchored to the scarcity of human cognition, producing a self-reinforcing downward spiral in labor income, demand, and intermediary margins that can tip into an explosive crisis unless offset by sufficiently fast reinstatement of human-paid demand or deliberate policy/market responses.

Analytical macro-financial model coupling firm-level substitution decisions, aggregate demand mapping, and financial-sector balance-sheet propagation; calibrated numerical simulations using U.S. macro time series (FRED), BLS occupation-level employment and wages, and published occupation-level AI-exposure indices; phase diagrams and scenario time-paths reported in the paper.

medium negative Abundant Intelligence and Deficient Demand: A Macro-Financia... macro-financial stress (aggregate labor income, demand, intermediary margins, an...

Distributional shifts and regime changes require periodic revalidation or TSFM updates to maintain reliable performance.

Paper discussion of limitations and recommended operational procedures (revalidation and periodic TSFM updates) to handle non-stationarity and regime shifts; rationale based on time-series modeling risks.

medium negative Regression Models Meet Foundation Models: A Hybrid-AI Approa... Robustness of forecasting performance under distributional/regime shifts

If the TSFM produces biased or poor forecasts in certain regimes, those errors can propagate into the downstream regression and harm performance.

Stated caveat in the paper (theoretical/empirical rationale); logical consequence of using TSFM-generated features as inputs—error propagation risk discussed in analysis/limitations section.

medium negative Regression Models Meet Foundation Models: A Hybrid-AI Approa... Downstream forecast error sensitivity to TSFM forecast quality

Manual qualitative coding does not scale to massive social datasets, and frequency-based topic models suffer from 'semantic thinning' and lack domain awareness.

Conceptual statement presented as motivation; based on conventional critiques of hand-coding and bag-of-words topic models rather than new empirical evidence in this paper's summary.

medium negative THETA: A Textual Hybrid Embedding-based Topic Analysis Frame... scalability of manual coding; semantic fidelity of frequency-based topic models

Rapid coherence decay with thread depth suggests collective problem solving or consensus formation among these agents will be shallow and brittle.

Embedding-based coherence metrics demonstrating fast decline in similarity with increasing thread depth across the dataset; inferential claim about effects on deliberation and consensus processes.

medium negative What Do AI Agents Talk About? Emergent Communication Structu... coherence as a function of thread depth and inferred effect on multi-turn delibe...

Low emotional alignment and frequent affective redirection indicate human emotional contagion models may not apply to AI-agent interaction, which could produce unstable or counterintuitive coordination dynamics.

Emotion-classification results showing 32.7% mean self-alignment and 33% fear→joy response rate; theoretical interpretation comparing these patterns to human emotional contagion expectations.

medium negative What Do AI Agents Talk About? Emergent Communication Structu... emotional self-alignment and emotion transition rates; implication for coordinat...

Ritualized signaling could create apparent activity (volume, buzz) without substantive informational content, opening avenues for manipulation or mispriced assets.

Observed high rates of patterned/formulaic replies and concentrated non-informational activity patterns in Moltbook; inferential reasoning about how signal amplification without content could affect market perception and asset pricing.

medium negative What Do AI Agents Talk About? Emergent Communication Structu... volume of formulaic/ritualized activity and potential effect on perceived market...

High prevalence of formulaic comments (≈56%+) implies large volumes of low-information signaling that can degrade signal-to-noise ratio in information environments, harming price discovery and liquidity forecasting.

Empirical observation of >56% formulaic comments via lexical-pattern analysis, combined with theoretical inference about information quality and market microstructure (argument linking high low-information reply volume to degraded signal-to-noise).

medium negative What Do AI Agents Talk About? Emergent Communication Structu... percentage of formulaic replies and inferred effect on information quality metri...

These methodological adaptations reduce but do not eliminate validity threats; they often increase complexity and cost while leaving unresolved issues of generalizability and time-dependence.

Practitioner accounts (n=16) describing limits/tradeoffs of adaptations; authors' synthesis concluding residual threats remain despite adaptations.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... effectiveness and tradeoffs of mitigation strategies for validity threats

External validity is limited: results from a given trial may not generalize across model versions, populations, tasks, or to temporally distant deployments.

Interview-derived themes (16 practitioners) and authors' analytic mapping to external validity concerns; supported by examples of model/version dependence discussed in interviews.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... generalizability/external validity of trial results across versions, populations...

Construct validity is threatened because commonly used outcome measures can misrepresent the constructs of interest when AI changes task structure or human strategies.

Practitioners' reports in semi-structured interviews (n=16) and authors' synthesis illustrating cases where metrics no longer capture intended constructs after AI introduction.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... construct validity of outcome measures (accuracy of metrics in capturing intende...

Common internal validity threats in uplift studies of frontier AI include violations of treatment fidelity and SUTVA (e.g., contamination, time-varying treatments).

The paper's validity-consequences section, based on thematic analysis of 16 interviews and mapping practitioner-reported problems to internal validity constructs.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... treatment fidelity and SUTVA adherence in RCTs measuring uplift

Porous real-world settings cause spillovers and contamination across experimental arms, violating SUTVA and threatening internal validity.

Multiple practitioners (n=16) reported examples of spillovers and contamination during deployment-like studies; thematic analysis mapped these to SUTVA/treatment-fidelity concerns.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... internal validity (SUTVA, treatment contamination) of uplift trials

Shifting baselines (changes in tools, protocols, or knowledge during and across studies) complicate defining an appropriate control or status quo.

Interview data (16 practitioners) and thematic analysis identifying shifting baselines as a recurring challenge reported by participants.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... construct validity of the control/status-quo definition in uplift studies

Rapidly evolving models (nonstationarity) make any single trial a moving target, undermining the temporal stability of measured uplift.

Practitioner reports from semi-structured interviews (n=16) describing model updates and performance changes during/after trials; thematic coding indicating nonstationarity as a common concern.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... temporal stability/generalizability of measured uplift across model versions

Properties of frontier AI — rapid model evolution, shifting baselines, heterogeneous and changing users, and porous real-world settings — regularly strain internal, construct, and external validity of human uplift studies.

Recurring themes identified via qualitative analysis of 16 practitioner interviews; mapped to internal/construct/external validity dimensions in the paper's results.

medium negative RCTs & Human Uplift Studies: Methodological Challenges and P... internal, construct, and external validity of human uplift RCTs

Instability of agent rankings across configurations makes procurement and deployment decisions based on narrow benchmarks risky; firms should evaluate agents under their own scaffolds, datasets, and workflows before committing.

Empirical finding of ranking instability across models, scaffolds, and datasets; methodological recommendation derived from that instability.

medium negative Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... robustness_of_benchmark_based_procurement (risk_of_misleading_benchmarks)

Claims that AI will imminently replace human auditors are overstated; real-world economic benefits are more likely to come from complementary automation (breadth + triage) rather than full substitution.

Interpretation based on empirical failures in end-to-end exploitation, instability across configurations, and scaffold sensitivity observed in this study.

medium negative Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... economic_value_of_automation (qualitative_assessment_of_substitution_vs_compleme...

Detection and exploitation rankings are unstable: rankings shift across model configurations, tasks, and datasets, so results are not robust to evaluation choices.

Observed variability in detection/exploitation rankings across the expanded matrix of models, scaffolds, and datasets in the study's experiments.

medium negative Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... ranking_stability (consistency_of_model_rankings_across_configs_and_datasets)

High within-person variability and statement-dependent ambiguity imply noisy sentiment labels that can attenuate estimated effects in econometric analyses (measurement error / attenuation bias).

Empirical findings of moderate within-person stability and strong statement dependence in a sample of 81 students labeling decontextualized statements; combined with standard measurement-error theory (paper’s implication for applied analyses).

medium negative Exploring Indicators of Developers' Sentiment Perceptions in... expected bias (attenuation) in estimated associations when using noisy sentiment...

Standardized platforms and benchmarks may create network effects and lock-in around dominant hardware–software stacks; antitrust and standards policy will matter to preserve competition.

Workshop participants' market-structure analysis and policy discussion included in the summary recommendations (NSF workshop, Sept 26–27, 2024).

medium negative Report for NSF Workshop on Algorithm-Hardware Co-design for ... market concentration metrics, prevalence of platform lock-in, and competition in...

The sphere + dislodgement-threshold material approximation may not capture all real-world mechanical and adhesive properties, limiting generalization.

Authors note/modeling limitation: summary explicitly states the material physics are approximated and may not capture all real-world properties; this is presented as a limitation rather than an empirical result.

medium negative Learning Adaptive Force Control for Contact-Rich Sample Scra... generalization/physical fidelity of the simulation model (limitation)

Key technical and organizational risks include model brittleness, privacy and IP concerns in code generation (training-data provenance), and increased governance and QA burdens.

Literature review highlighting known risks and survey responses reporting practitioner concerns; no quantified incident rates provided.

medium negative Artificial Intelligence as a Catalyst for Innovation in Soft... reported incidence or concern levels about risks (qualitative)

Practitioners report barriers to adoption including integration costs, lack of trust/explainability, poor data quality, and skills gaps.

Thematic analysis / coding of open-ended survey responses and literature review identifying common adoption barriers; survey sample size not specified.

medium negative Artificial Intelligence as a Catalyst for Innovation in Soft... prevalence of reported barriers in survey responses

Signals may be gamed by providers or agents; incentive-compatible design and auditability are crucial.

Risk/limitations noted by the authors as a foreseeable strategic behavior problem; presented as a caution rather than empirically observed gaming in the current dataset.

medium negative Task-Aware Delegation Cues for LLM Agents vulnerability to strategic manipulation of signals (qualitative risk)

GDP and productivity metrics that ignore interpretive labor risk understating the inputs to creative and knowledge work; RATs offer a means to measure previously invisible inputs.

Policy argument in the measurement/productivity subsection; no empirical re-estimation of GDP/productivity presented.

medium negative Chasing RATs: Tracing Reading for and as Creative Activity completeness of productivity/GDP measurement with respect to interpretive labor

Algorithmic feeds and AI summarizers tend to compress or automate interpretive traces, potentially erasing signals of reasoning, context, and tacit knowledge.

Conceptual claim supported by argumentation and examples in the paper; no empirical comparison between RATs and existing summarizers is presented.

medium negative Chasing RATs: Tracing Reading for and as Creative Activity loss of interpretive trace signals (reasoning/context/tacit knowledge) when usin...

Human ratings and preference-trained metrics reward visually vivid but exaggerated color and contrast, which leads to outputs that are less photorealistic when photorealism is the intended objective.

Reported experiments in the paper comparing human preference ratings and preference-trained evaluators against a color-fidelity-focused ground truth (CFD). The authors state these existing evaluators favor high saturation/contrast and qualitatively and quantitatively select images that are 'too vivid' relative to photographic realism (paper reports qualitative examples and quantitative comparisons; exact sample sizes and statistical values are described in paper but not provided in the summary).

medium negative Too Vivid to Be Real? Benchmarking and Calibrating Generativ... perceived photorealism / alignment with color realism (human preference and pref...

Prior work often conflates feedback source and feedback model; this study isolates them through controlled experiments.

Authors' literature review and the paper's experimental design explicitly constructed to disentangle source and model effects.

medium negative A Systematic Study of Pseudo-Relevance Feedback with LLMs Degree to which prior studies separate PRF design dimensions (methodological ass...

QCSC systems are capital- and skill-intensive, favoring well-resourced incumbents (large tech firms, national labs, major pharma/materials companies), potentially increasing concentration in compute-enabled domains.

Economic and industry-structure reasoning based on anticipated capital costs, specialized skills required, and comparison to existing capital-intensive compute infrastructures; no empirical market-share data.

medium negative Reference Architecture of a Quantum-Centric Supercomputer market concentration and firm advantage in compute-enabled R&D domains

Recent quantum advantage demonstrations for quantum-system simulation show utility, but practical applied research requires hybrid workflows that neither QPUs nor classical HPC can efficiently execute alone.

Review and synthesis of published quantum-simulation demonstrations and known performance/scaling limits of classical HPC; qualitative analysis of hybrid algorithm requirements; no new experiments.

medium negative Reference Architecture of a Quantum-Centric Supercomputer ability of standalone QPUs or classical HPC to execute full applied-research hyb...

Under realistic limitations (distribution shift, very large prompt inventories, or severe cold starts), DPS’s realized rollout savings and performance gains may be reduced.

Authors list these scenarios as potential limitations and caveats in the Discussion/Limitations section; no quantification provided in the summary.

medium negative Dynamics-Predictive Sampling for Active RL Finetuning of Lar... magnitude of rollout savings and performance gains under adverse conditions

Contracts and incentives based on expected performance can incentivize strategies that deliver high expected returns but poor or unreliable time-average outcomes; incentive design should account for path-dependent risks.

Theoretical/incentive argument and examples in the paper linking objective mismatch to adverse incentives; illustrative reasoning rather than empirical contract studies.

medium negative Ergodicity in reinforcement learning alignment/misalignment of incentives with reliable long-run (time-average) perfo...

Economic evaluations and deployment decisions that rely on ensemble expectations can misstate economic value and risk because firms and users experience single time-averaged trajectories; regulators and decision-makers should therefore prefer objectives reflecting single-run guarantees when relevant.

Conceptual mapping of the theoretical results to economic decision-making and deployment risk; policy and incentive discussion in the paper (argumentative, not empirical).

medium negative Ergodicity in reinforcement learning accuracy of economic valuation and risk assessment when using ensemble expectati...

The paper's illustrative example shows a policy that maximizes expected reward can produce trajectories that lock into high- or low-reward regimes so an agent’s long-term realized reward is highly uncertain and not captured by the expectation.

Constructed example provided in the paper; demonstration of divergent single-trajectory outcomes under a single policy; no empirical sample size (example-based).

medium negative Ergodicity in reinforcement learning distribution (uncertainty) of long-term realized reward across individual trajec...

In contexts analogous to AI markets, a firm at a network/geographic disadvantage would need exponentially greater scale (users/data/compute) to match the probability of early discovery achieved by a better-positioned rival.

Interpretation/translation of the model's analytic scaling result to market-relevant quantities; this is a theoretical implication rather than an empirically tested claim.

medium negative Macroscopic Dominance from Microscopic Extremes: Symmetry Br... required scale (users, data, compute) to match probability of early discovery fo...

Expect diminishing returns from AI investments if parallel investments in organizational change and data governance are not made.

Synthesis of case evidence and theoretical argument: instances where additional AI investment produced limited marginal benefit absent organizational complements.

medium negative Optimizing integrated supply planning in logistics: Bridging... marginal returns to AI (performance per unit AI investment)

Legacy systems and siloed organizational structures produce persistent forecasting inaccuracies, operational disconnects, and constrained responsiveness.

Cross-case interview narratives documenting continued forecasting issues and operational misalignment in firms with legacy IT and functional silos.

medium negative Optimizing integrated supply planning in logistics: Bridging... forecasting accuracy, operational alignment, responsiveness (lead times)

MLOps and governance provisions shift costs from one-off implementation to ongoing maintenance, implying recurring costs that should be captured in economic evaluations.

Analytical/economic argument presented in the paper as an implication of including an MLOps layer (conceptual; no empirical cost accounting provided).

medium negative ALGORITHM FOR IMPLEMENTING AI IN THE MANAGEMENT LOOP OF SMES... cost structure (recurring maintenance costs vs one-off implementation costs)

Adoption complementarities (AI tools + developer skill + organizational processes) favor larger incumbents and well‑funded firms, possibly increasing concentration in tech sectors.

Theoretical argument about complementarities and returns to scale; illustrative examples; lacks firm‑level empirical testing.

medium negative How AI Will Transform the Daily Life of a Techie within 5 Ye... market concentration measures (market share, concentration ratios) and different...

In the near term, displacement risks concentrate on junior or highly routine roles; mobility and retraining will determine realized unemployment impacts.

Task automatability mapping indicating routine tasks more automatable and qualitative reasoning on labor mobility; no empirical unemployment projections.

medium negative How AI Will Transform the Daily Life of a Techie within 5 Ye... employment outcomes for junior/highly routine roles (displacement rates, unemplo...

Adoption will be heterogeneous: larger firms and well‑resourced teams will capture more gains earlier, producing competitive advantages.

Theoretical argument about adoption complementarities (AI tools + developer skill + organizational processes) and illustrative examples; no cross‑firm empirical analysis.

medium negative How AI Will Transform the Daily Life of a Techie within 5 Ye... heterogeneity in productivity gains and market advantage by firm size/resource l...

Differential adoption across firms (due to modular, scalable designs and data advantages) may create winner‑takes‑most effects and increase market concentration, benefiting early adopters with rich data/integration capabilities.

Market-structure claim supported by economic reasoning about scale and data advantages; no cross-firm empirical adoption study or market concentration time‑series is provided.

medium negative Next-Generation Financial Analytics Frameworks for AI-Enable... market concentration metrics (e.g., HHI), firm market shares, adoption timing di...

Initial investment, integration, and ongoing maintenance/compliance costs can be substantial and affect short-term ROI.

Interviewed administrators and implementation reports citing upfront and recurring costs (integration, model maintenance, compliance); quantitative budget figures not standardized across sites in the paper.

medium negative The Role of Artificial Intelligence in Healthcare Complaint ... implementation and maintenance costs; short-term return on investment (ROI)

Risk of deskilling or reduced empathy if human roles are overly automated.

Thematic analysis of staff interviews and surveys reporting concerns about loss of practice, reduced patient contact, and potential diminishment of empathetic skills; no longitudinal measures of skill loss presented.

medium negative The Role of Artificial Intelligence in Healthcare Complaint ... staff-reported empathy/skill levels and qualitative indicators of deskilling

Technical and organizational integration with legacy hospital IT systems is nontrivial.

Implementation reports and interviews describing integration work, time, and resource needs; descriptive accounts of technical and organizational barriers (no universal timelines/costs reported).

medium negative The Role of Artificial Intelligence in Healthcare Complaint ... integration difficulty/time/cost (implementation burden)

Algorithmic bias in NLP models can misclassify complaints from underrepresented groups.

Observations from system classification error analyses (disparities reported by demographic group) and corroborating qualitative concerns from staff and administrators; specific subgroup sample sizes and effect magnitudes not provided.

medium negative The Role of Artificial Intelligence in Healthcare Complaint ... differential misclassification rates by demographic group (bias in NLP classific...

Data privacy and security risks arise from centralizing complaint text and metadata.

Stakeholder interviews, thematic coding of concerns, and risk assessment commentary based on centralized logs and metadata aggregation; no measured breach incidents reported here.

medium negative The Role of Artificial Intelligence in Healthcare Complaint ... privacy/security risk (qualitative risk indicators; potential exposure of compla...

Organizations will incur additional governance and procurement costs (diversity audits, recalibration of reward models, multi-model infrastructures) to mitigate homogenization, shifting some economic benefits of AI toward governance spending.

Cost implication argued from the need for auditing and multi-model procurement described in recommendations; not supported by quantified cost analyses in the paper.

medium negative The Artificial Hivemind: Rethinking Work Design and Leadersh... governance and procurement costs associated with LLM deployment

« Prev 1 2 3 … 235 236 237 … 303 304 Next »