Evidence (8807 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	870	233	116	1066	2363
Governance & Regulation	976	451	218	133	1809
Organizational Efficiency	949	224	144	88	1416
Technology Adoption Rate	764	287	141	122	1325
Research Productivity	501	152	74	362	1101
Output Quality	542	216	69	69	896
Decision Quality	387	198	94	54	740
Firm Productivity	513	67	101	27	714
AI Safety & Ethics	249	303	73	36	667
Market Structure	190	192	134	27	548
Task Allocation	243	77	91	36	452
Innovation Output	291	33	55	20	401
Skill Acquisition	206	72	65	21	364
Employment Level	133	63	115	22	335
Fiscal & Macroeconomic	153	79	52	32	323
Task Completion Time	206	37	12	15	272
Firm Revenue	179	52	29	5	266
Consumer Welfare	130	76	47	13	266
Inequality Measures	48	137	51	6	242
Worker Satisfaction	101	81	25	13	220
Error Rate	84	110	11	5	210
Wages & Compensation	98	47	30	10	185
Regulatory Compliance	88	73	17	7	185
Automation Exposure	66	64	33	16	182
Team Performance	105	29	30	11	176
Training Effectiveness	109	22	14	21	168
Developer Productivity	114	21	14	8	158
Job Displacement	12	90	24	1	127
Hiring & Recruitment	57	9	9	5	80
Skill Obsolescence	6	56	9	1	72
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	21	17	1	57
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Productivity Remove filter

Pure behavioural teams (N=8) failed to scale beyond 74.1%.

Reported team performance metric for 'pure behavioural' teams with sample size N=8; maximum reported performance 74.1%.

high negative The Timing Dependencies of Trust: Speed, Accuracy, and cBCI ... team accuracy/performance ceiling

Fast AI induced instant, blind compliance; human accuracy under deception collapsed to 50.2%.

Reported experimental result comparing Fast/Less-Accurate AI condition to baseline conditions; numeric accuracy reported as 50.2% for humans under deception.

high negative The Timing Dependencies of Trust: Speed, Accuracy, and cBCI ... human accuracy under AI deception

There is an urgent question of how humans can effectively supervise and control an economy operated by AI agents when this system may expand beyond the capacity of traditional governance.

Framed as a central research/policy concern in the paper's abstract; conceptual argument rather than empirical finding.

high negative Regulatory Policy for the Agent Economy in the Digital Age: ... capacity of traditional governance to supervise/control AI-operated economy

The Agent Economy raises new regulatory challenges concerning data privacy, security, ethics, and the risk of job displacement.

Stated in paper abstract as identified risks; based on literature synthesis and comparative policy analysis approach (method described), but no empirical incidence metrics reported.

high negative Regulatory Policy for the Agent Economy in the Digital Age: ... regulatory challenges related to privacy, security, ethics, and job displacement...

The requirement that review + expected rework attention be lower than manual completion attention is substantially more stringent than the requirement that AI merely generate faster drafts.

Comparative analytical argument based on the model's derived stability conditions (theoretical/model-based reasoning; no empirical sample reported).

high negative Queue & AI: When Faster Tasks Slow Down the Workflow developer_productivity

Under congestion, reviewers rationally raise the risk threshold for checking AI outputs, reducing scrutiny precisely when it would matter the most.

Analytical implication derived from the queueing model presented in the paper (theoretical/model-based inference; no empirical validation reported).

high negative Queue & AI: When Faster Tasks Slow Down the Workflow decision_quality

Mean-based metrics (e.g., tasks completed per worker-hour or mean handle time) can misrepresent AI's effects in workflows where tasks accumulate and compete for scarce human attention.

Argument and analysis presented in the paper; theoretical reasoning and illustrative queueing model (no empirical sample reported).

high negative Queue & AI: When Faster Tasks Slow Down the Workflow task_completion_time

Regardless of apparent performance advances in AI technology, human and environmental factors of the organization may substantially attenuate — or even negate — the effective productivity benefits.

Conceptual argument in the paper; theoretical reasoning and literature synthesis (no primary empirical data reported in the abstract).

high negative Position: Adopting AI in Practice Does Not Guarantee the Pro... realized productivity benefits from AI deployment

Adopting AI in organizational practice does not guarantee productivity gains, because human and environmental factors critically moderate the relationship between AI deployment and realized productivity improvements.

Position paper's conceptual argument presented in the abstract; no empirical sample or quantitative study reported.

high negative Position: Adopting AI in Practice Does Not Guarantee the Pro... productivity gains (realized productivity improvements)

AI adoption presents workforce adaptation challenges.

Reported in the study's literature synthesis and thematic analysis of secondary sources (qualitative review). No sample size reported.

high negative Human–AI Collaboration in the Indian IT Industry: A Qualitat... workforce adaptation / need for retraining

AI adoption raises ethical considerations.

Authors' thematic evaluation of secondary literature identifying ethical issues associated with human-AI collaboration (qualitative synthesis). No sample size reported.

high negative Human–AI Collaboration in the Indian IT Industry: A Qualitat... ethical risks and considerations

AI adoption presents challenges related to skill gaps.

Thematic findings from peer-reviewed literature and secondary data (qualitative review). No sample size reported.

high negative Human–AI Collaboration in the Indian IT Industry: A Qualitat... skill gaps / workforce skill mismatch

There is a 'speedup illusion' where people have accurate forecasts of independent completion times but significantly underestimate AI-assisted times.

Empirical pattern reported in the abstract: comparison of predicted vs. actual times shows accurate independent forecasts but underestimation of AI-assisted completion times (preregistered study, N = 1237).

high negative Cognitive offloading and the speedup illusion in human-AI in... calibration of predicted vs actual completion time

As these systems scale, the bottleneck shifts away from raw model capability toward coordination.

Analytical/argumentative claim in the paper framing a shift in primary constraint; no empirical study or quantified benchmark reported.

high negative Foundation Protocol: A Coordination Layer for Agentic Societ... primary system bottleneck (model capability versus coordination capacity)

More persuasive narratives may have had a detrimental effect on the ability to discriminate between a correct and incorrect AI prediction.

Exploratory analyses in the paper reporting reduced discrimination between correct and incorrect AI predictions when narratives were more persuasive.

high negative Human Decision-Making with Persuasive and Narrative LLM Expl... ability to discriminate correct vs. incorrect AI predictions

More persuasive narratives may have had a detrimental effect on decision response times.

Exploratory analyses reported in the paper indicating persuasive narratives were associated with longer decision response times.

high negative Human Decision-Making with Persuasive and Narrative LLM Expl... decision response time

Higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings.

Argument based on review of current knowledge-work evaluation and benchmark design literature; paper motivates with conceptual analysis and references to empirical work showing mismatch between benchmark tasks and deployed work settings.

high negative Design and Report Benchmarks for Knowledge Work ability of a system to carry out knowledge work in real-world deployment setting...

Current systems still struggle with evidence preservation, reproducibility, weak-direction rejection, provenance tracking, cross-domain robustness, and accountable scientific closure.

Survey-identified recurring failure modes and limitations reported in literature and system descriptions; qualitative synthesis.

high negative AutoResearch AI: Towards AI-Powered Research Automation for ... capabilities related to evidence preservation, reproducibility, rejection of wea...

Current systems remain fragmented, differing in autonomy, domain scope, execution environment, validation mechanism, and human oversight.

Survey of existing systems and categorization across the listed dimensions; descriptive synthesis rather than an empirical meta-analysis.

high negative AutoResearch AI: Towards AI-Powered Research Automation for ... heterogeneity/fragmentation across AI research systems along autonomy, domain sc...

Even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications.

Empirical evaluation reported in the paper comparing two SOTA coding agents on a suite of 7 distributed key-value-store specifications; success counted as meeting the specification.

high negative Inductive Deductive Synthesis: Enabling AI to Generate Forma... output_quality

Large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks but their deployment in high-throughput, latency-sensitive environments remains impractical.

Statement about model performance on public benchmarks (upper bounds) and practical deployment constraints (throughput and latency), asserted by authors; no numerical deployment analysis provided in excerpt.

high negative HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLM... deployability / practicality in latency-sensitive, high-throughput environments

Evaluating state-of-the-art kernel agents on FastKernels, the strongest agent achieves only 0.94× aggregate speedup over production baselines, with weaker agents at 0.78× and 0.53×.

Empirical evaluation of multiple state-of-the-art kernel-generation agents on the FastKernels benchmark; aggregate speedup factors reported in abstract. The number of benchmark tasks is likely the FastKernels task set (46), though the abstract does not explicitly state the evaluation sample size for this measurement.

high negative FastKernels: Benchmarking GPU Kernel Generation in Productio... aggregate runtime speedup relative to production baselines

Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones.

Stated as motivating observation in the paper (conceptual/empirical critique of existing benchmark design and incentives). No numerical sample size given in the abstract.

high negative FastKernels: Benchmarking GPU Kernel Generation in Productio... benchmark-production alignment

Integrations of AI that neglect human factors are associated with increased anxiety, burnout, and disengagement among users.

Aggregate findings from the systematic review reporting associations in the literature between non-human-centered AI integration and negative psychological/work outcomes.

high negative Yapay Zeka Sistemleri ve İnsan İşbirliğinin Psikolojik, Sosy... anxiety, burnout, disengagement

Notable challenges to AI implementation include concerns about algorithmic bias, privacy, transparency, job displacement, organizational culture, and issues related to ethical and legal oversight.

Synthesis of reported challenges across the 29 empirical studies included in the scoping review.

high negative The influence of AI-Driven Employee Performance Management (... implementation barriers and risks (bias, privacy, transparency, displacement, cu...

Fragmented, uncoordinated approaches in the absence of national strategy constitute a structural barrier to technological development in Georgia.

Method: logical inference and country assessment presented in the paper documenting fragmentation across policy and institutional actors; qualitative evidence rather than quantitative causal estimation.

high negative Economic Impact of Artificial Intelligence and Policy Framew... barriers to technological development / policy fragmentation

In Georgia, the total absence of a national AI strategy and legal definition produces fragmented approaches, creating a structural barrier to technological development.

Method: country-level assessment of policy and legal framework for AI in Georgia; descriptive analysis identifying lack of a national strategy and definition. (No sample size reported.)

high negative Economic Impact of Artificial Intelligence and Policy Framew... technological development / policy coherence

Zero-shot evaluation shows the best positive-query mask success rate at IoU@0.75 remains below 0.17.

Empirical evaluation reported in the paper: zero-shot tests across 26 model configurations with reported mask success rate at IoU@0.75.

high negative AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... positive-query mask success rate at IoU@0.75

Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-F1 reaches only 0.35.

Empirical evaluation reported in the paper: zero-shot tests across 26 model configurations with reported Set-F1 metric.

high negative AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... multi-target Set-F1

Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image.

Problem characterization / motivation described in the paper (qualitative reasoning about dataset and task properties).

high negative AgroVG: A Large-Scale Multi-Source Benchmark for Agricultura... difficulty of reliable evaluation for agricultural visual grounding

Technical bottlenecks (cross-border data compliance, algorithm interpretability) and ethical challenges (algorithmic bias, privacy infringement, cultural conflicts) are intertwined impediments to intelligent international marketing.

Synthesis of challenges identified across the reviewed literature (systematic review and content analysis, 2010–2025) as reported in the paper.

high negative Research on International Marketing in the Context of Intell... presence and interrelation of technical and ethical barriers

Traditional international marketing theories, constrained by static assumptions and linear logic, struggle to explain intelligent contexts.

Conclusion from the paper's systematic review and content analysis of core literature (2010–2025); no quantitative test or sample size reported in the summary.

high negative Research on International Marketing in the Context of Intell... theoretical explanatory adequacy of traditional international marketing theories

Cost and lack of applicable use case are the most cited barriers to AI adoption, followed by expertise.

Survey question(s) on barriers to adoption in the Census Bureau survey in which respondents reported reasons for not adopting AI; ranking provided in the paper (cost, lack of use case, then expertise).

high negative The Adoption of Industrial AI in America reported barriers to AI adoption (cost, applicability, expertise)

Intensity-weighted adoption is far lower than the 22.8 percent headline rate.

Survey-derived intensity-weighted measure of AI adoption constructed from the same Census Bureau survey (no numeric value reported in the excerpt).

high negative The Adoption of Industrial AI in America intensity-weighted AI adoption

Only 22.8 percent of plants report any AI use as of 2021.

Direct descriptive estimate from the Census Bureau survey of manufacturing establishments; year reported as 2021.

high negative The Adoption of Industrial AI in America share of plants reporting any AI use

ID-centric ranking models fail to generalize in livestreaming recommendation due to the short-lived nature of live rooms and poorly learned item IDs.

Authors' assertion linking the cold-start item ID problem to poor generalization of ID-centric rankers (motivating claim). No specific experimental metrics or sample sizes cited in the excerpt.

high negative FLUID: From Ephemeral IDs to Multimodal Semantic Codes for I... generalization performance of ID-centric ranking models

A live room typically broadcasts for only tens of minutes, so its item ID remains poorly learned in a persistent cold-start state.

Authors' observational/operational claim about livestream characteristics stated in the paper (motivating problem statement). No sample size or quantitative backing provided in the excerpt.

high negative FLUID: From Ephemeral IDs to Multimodal Semantic Codes for I... cold-start state of item IDs (poorly learned embeddings)

Static benchmarks capture only part of how large language models behave in practice.

Argument supported by the paper's experimental design comparing static evaluations with a timed multi-phase Risk environment that includes repeated planning/execution loops and real-system constraints.

high negative Evaluating Large Language Models as Live Strategic Agents: P... coverage_of_model_behavior_by_static_benchmarks

In deployed settings, the effects of AI systems on human agency, creativity, and institutional well-being emerge over time, shaped by repeated interaction, reuse, and integration into real-world workflows, and these dynamics are rarely visible through pre-deployment evaluation or isolated prompt–response analysis.

Argumentative observation based on conceptual reasoning; no empirical data or sample size reported.

high negative Post-Deployment Observability as a Foundation for Well-Being... emergent effects on human agency and creativity arising from extended AI use

The most significant barriers to AI adoption reported by entrepreneurs are human-centred—talent scarcity, organisational resistance, and change management—rather than technology or cost alone.

Theme 'Barriers and the Adoption Journey' from thematic analysis of interviews (n=16); interviewees repeatedly cited human-centred barriers (talent scarcity, resistance, change management) over purely technical/cost barriers.

high negative Navigating the Intelligence Frontier: AI Adoption as a Succe... adoption barriers (human-centred constraints)

Raw interaction logs are inherently noisy, contain trial-and-error and low information density, and are inefficient for direct model training.

Author assertion describing properties of raw interaction logs; no empirical quantification provided in the excerpt.

high negative Echo: Learning from Experience Data via User-Driven Refineme... information density and training-efficiency of raw interaction logs

Static 'human data' is expensive to scale and bounded by the knowledge of its creators.

Author claim/argument in the paper's introduction; no empirical sample or quantitative test reported in the provided text.

high negative Echo: Learning from Experience Data via User-Driven Refineme... scalability and knowledge coverage of human-generated training data

People exhibit self-estimate miscalibration: on average they believe they are using AI less than they actually are.

Same three pre-registered user studies (combined N = 2691) comparing participants' self-reported AI use against observed/recorded AI use during tasks.

high negative The efficiency-gain illusion: People underestimate the rate ... discrepancy_between_self_reported_and_actual_AI_use

Low-information AI neither improves immediate performance nor preserves performance after AI assistance is removed, and is linked to weaker learning overall.

Within-study comparison of low-information AI assistance versus other conditions in the controlled logical reasoning task; immediate and post-AI performance measured (sample size not reported in abstract).

high negative The Impact of AI Usage and Informativeness on Skill Developm... immediate performance and post-AI performance (skill retention/learning)

Greater AI usage is associated with weaker skill development: heavy AI users underperform relative to comparable peers, whereas light AI users perform similarly to matched users who do not use AI.

Controlled experiment using a logical reasoning task with on-demand AI assistance; comparison between heavy users, light users, and matched non-users reported in the study (sample size not stated in abstract).

high negative The Impact of AI Usage and Informativeness on Skill Developm... skill development / performance after AI assistance removed

Regulatory uncertainty and the absence of explicit legislation on digital data and artificial intelligence may leave the economic potential of these technologies unexplored while increasing market concentration, inequality, and the risk of personal information misuse.

Argued implications from the paper's theoretical model and comparative legal discussion; no empirical testing or quantified analysis provided.

high negative ECONOMIC SYSTEMS IN THE CONTEXT OF DIGITALISATION AND AI: TH... risk of unexploited economic potential, market concentration, inequality, and da...

Studies finding true synergy are scarce.

Authors' literature synthesis / meta-analytic overview claiming that few studies report combined human-AI performance exceeding both parties alone (no numerical count provided).

high negative Addressing the Synergy Gap: The Six Elements of the Design S... number/prevalence of studies reporting genuine synergy

Genuine human-AI synergy—combined performance that exceeds what either party achieves alone—is uncommon.

Authors' synthesis of the literature and meta-analytic findings referenced in the paper indicating scarcity of studies showing combined performance > either alone (no specific counts or sample sizes given in the excerpt).

high negative Addressing the Synergy Gap: The Six Elements of the Design S... frequency/prevalence of human-AI combinations achieving superior combined perfor...

Agentic systems show persistent failures in repository setup, dependency handling, permission gating, and hardware verification.

Issue-resolution benchmarks and hardware/RTL verification research synthesized in the paper (specific failure rates or sample sizes not provided in abstract).

high negative Agentic Agile-V: From Vibe Coding to Verified Engineering in... failure modes/errors in repository and hardware-related tasks

Controlled studies report slowdowns in mature open-source work when using agentic/code-generation systems.

Controlled studies and trials cited in the paper (no sample sizes given in abstract).

high negative Agentic Agile-V: From Vibe Coding to Verified Engineering in... productivity/performance in mature open-source development

« Prev 1 2 3 … 16 17 18 … 176 177 Next »