Evidence (14922 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	795	210	105	955	2131
Governance & Regulation	886	414	197	126	1654
Organizational Efficiency	826	204	129	87	1257
Technology Adoption Rate	681	259	128	110	1189
Research Productivity	464	138	65	349	1028
Output Quality	503	196	61	53	813
Decision Quality	351	180	84	51	673
AI Safety & Ethics	238	288	71	34	637
Firm Productivity	455	58	92	20	631
Market Structure	186	172	123	25	511
Task Allocation	222	70	76	34	407
Innovation Output	238	28	48	18	334
Skill Acquisition	177	62	62	17	318
Employment Level	107	57	108	13	287
Fiscal & Macroeconomic	135	72	44	26	284
Firm Revenue	172	50	28	5	256
Consumer Welfare	121	68	45	12	246
Task Completion Time	183	33	10	13	240
Inequality Measures	45	126	50	6	227
Worker Satisfaction	95	74	23	12	204
Error Rate	77	98	11	4	190
Regulatory Compliance	84	73	17	7	181
Automation Exposure	61	61	27	14	166
Training Effectiveness	98	21	14	19	154
Wages & Compensation	78	37	25	6	146
Developer Productivity	105	18	14	6	144
Team Performance	87	17	28	10	143
Job Displacement	12	83	23	1	119
Hiring & Recruitment	53	8	8	3	72
Social Protection	39	17	8	2	66
Creative Output	32	20	8	3	64
Skill Obsolescence	5	50	6	1	62
Labor Share of Income	17	20	17	—	54
Worker Turnover	15	15	—	3	33
Industry	—	—	—	1	1

Autonomous penetration capability continues to improve alongside advances in overall model capability.

Observed monotonic/positive relationship reported between model capability (presumably model size or general capability metrics) and penetration success across evaluated models.

medium positive The Emergence of Autonomous Penetration Capabilities in Larg... relationship between overall model capability and autonomous penetration success

CloudCons and the authors' analyses provide actionable guidelines and vital insights for real-world deployment decisions of forecasting-driven consolidation.

Authors' synthesis and recommended calibration rules based on their empirical experiments.

medium positive CloudCons: A Comprehensive End-to-End Benchmark for Cloud Re... practical guidance for deployment decisions

Hosting the precomputed KV provider-side (removing egress) enables reuse without the egress cost, analogous to production prompt-caching.

Architectural argument and analogy to existing provider-side prompt-caching practices described by authors.

medium positive Can I Buy Your KV Cache? egress_cost_elimination (by provider-side hosting)

Structured LLM pipelines can provide scalable, low-effort pre-mediation support broadly comparable to human mediators on short-term self-reported preparation outcomes.

Authors' synthesis of findings across the two controlled experiments (short-term self-reported measures and behavioral/metric improvements).

medium positive Automated Mediator for Human Negotiation: Pre-Mediation via ... scalability and effort required for pre-mediation support; comparability on shor...

The study triangulates concept, practice and market evidence in a single crosswalk, clarifying where Aviation 4.0 potential has materialised and where principal barriers persist, providing an evidence-based roadmap for MRO executives and policymakers.

Synthesis of the paper's multiple methods (literature review, survey, interviews, case studies) leading to an asserted contribution; method = cross-method triangulation within the study.

medium positive Aviation 4.0: the impacts of digital transformation on the a... clarity of mapping between conceptual potential and realised implementations (pr...

The geometry replicates under an encoder swap to BGE: 'LLM-class OAI lead' replicates at 3.37x.

Encoder swap stress-test described by authors (embedding encoder changed to BGE), with reported replication factor 3.37x for LLM-class OAI lead.

medium positive Stable Geometry, Reversing Poles: The Bipolar Structure of A... replication of LLM-derived OAI lead when using alternate embedding encoder (BGE)

Computer changes the scope of work that users attempt: queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks bundling interdependent subtasks, and unlock work activities that are essentially absent from Search usage among the same users.

Analysis of query content and categories in Perplexity production data comparing the types of work attempted with Computer versus Search within the same user base (content classification of occupational domain, cognitive level, expertise breadth, and task composition).

medium positive How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, ... scope and cognitive/occupational breadth of attempted tasks

Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement.

Qualitative and quantitative analysis of product logs showing Computer performing decomposition and end-to-end execution steps that correspond to manual orchestration by Search users in matched-session comparisons.

medium positive How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, ... task allocation between human and AI (automation of subtasks)

Self-evolution (rewriting adapter contents from prior trajectories) further improves SIGA, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration.

Experimental comparison in the paper showing performance improvements after applying self-evolution to SIGA; claims of highest held-out mean and parity/outperformance vs hand-designed configuration.

medium positive SIGA: Self-Evolving Coding-Agent Adapters for Scientific Sim... held-out GEOS mean performance (e.g., TreeSim)

Participatory AI systems substantially improve on each contributor's original priorities.

Experiments described in the paper comparing the participatory/compositional system's outputs to individual contributors' models, showing improvement relative to contributors' stated priorities (no numerical details in excerpt).

medium positive Scaling Participation in Modular AI Systems alignment / performance on contributors' priority objectives

Survey responses and interviews indicate a broader range of emerging competencies, suggesting the spectrum of required advanced digital skills is likely to expand in the near future.

Paper synthesizes survey and interview findings to infer an expanding set of competencies; this is a forward-looking interpretation rather than a strictly observed quantitative trend; no forecast model or time-series data reported.

medium positive Advanced digital skills demands and priorities in wind energ... anticipated expansion in range of required skills

A competent human paired with a frontier model can outperform current peer review.

Author argues—based on the experimental results and comparative performance—that human+frontier-model collaboration can exceed existing peer-review processes in finding/correcting errors.

medium positive Can AI Refute Economic Theory? Evidence from Beyond the Know... effectiveness_of_error_detection_relative_to_peer_review

These findings and institutional lessons extend beyond programming to credentialing systems (medical and legal boards, professional certification) that certify skill in a workforce increasingly shaped by AI.

Generalization / policy claim offered by authors (normative extrapolation from programming contest evidence to other credentialing systems).

medium positive When the Scaffold Stays On: AI, Practice Style, and Screenin... applicability of findings to credentialing systems' design and certification out...

Two levers follow from the contrast: (1) how AI is integrated into training, since within the screened pool AI-style practice coincides with stronger non-AI-aided performance; and (2) the design of AI-prohibited evaluation gates as a type-separating institution.

Interpretation and policy implication drawn from empirical results (conceptual recommendation; not a directly tested intervention in the paper).

medium positive When the Scaffold Stays On: AI, Practice Style, and Screenin... policy levers affecting skill certification and training outcomes

Inside the AI-prohibited ICPC environment, a shift toward AI-style practice predicts higher non-AI-aided scores for AI-era entrants.

Within-ICPC empirical analysis comparing entrants across eras (pre/post AI) and relating practice signature to ICPC non-AI-aided scores; specific sample size and estimates not provided in abstract.

medium positive When the Scaffold Stays On: AI, Practice Style, and Screenin... non-AI-aided ICPC scores

Archi enables fully private management of sensitive data by using locally-hosted, open-weight models.

Paper statement tying local hosting of open-weight models to the ability to manage sensitive data privately; no technical privacy audit or measurements reported in the quoted text.

medium positive Archi: Agentic Operations at the CMS Experiment privacy / data management capability

Locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.

Paper's comparative claim about model performance based on the same evaluation (human and automated grading of production question set); asserts that locally-hosted open-weight models are competitive and support private data management.

medium positive Archi: Agentic Operations at the CMS Experiment model performance (competitiveness) and capability for private data management

The system proves effective at operational tasks, resolving real-world queries posed by CMS operators.

Results reported from the evaluation using operator feedback and the production question set graded by human and automated panels (no numerical success rates provided in the text quoted).

medium positive Archi: Agentic Operations at the CMS Experiment resolution of real-world queries / task completion

Subgroup analysis reveals AACT can be particularly beneficial for some decision-makers such as those very familiar with AI technologies.

Subgroup analysis reported in the house price prediction case study indicating heterogenous effects by familiarity with AI (no subgroup sample sizes provided in abstract).

medium positive Understanding the Effects of AI-Assisted Critical Thinking o... decision improvement for users familiar with AI (reduced over-reliance / improve...

Existing insurance products are adapting to address agentic-AI exposures.

Market and product analysis discussed in the paper evaluating how cyber, professional liability, product liability and other products are being modified; descriptive review rather than systematic empirical measurement.

medium positive Insurance of Agentic AI adaptation of existing insurance products to agentic-AI risks

The composition pattern suggests AI-consistent drafting includes a modest, suggestive increase in name-inferred female plaintiffs.

Analysis of name-inferred gender among AI-flagged complaints compared to baseline; authors describe the increase as modest and suggestive.

medium positive The New Pro Se: Generative AI and the Surge in Federal Civil... share of name-inferred female plaintiffs among AI-flagged complaints

These findings can guide AI risk prioritization and clarify expert expectations about who should bear responsibility for mitigation.

Author interpretation of study results; paper asserts applicability of findings to policy/prioritization.

medium positive Prioritization of Risks from Artificial Intelligence: A Delp... utility of study findings for risk prioritization and responsibility assignment

AI assistance shows promise for increasing discretionary but beneficial work (tasks users intend but often skip) while preserving human control over final outcomes.

Synthesis/generalization based on randomized field experiment results (increased feedback provision and length; no negative effects on usefulness or time per character) and supporting qualitative interview findings. Empirical data from a 300-level ML course with 11 TAs and 88 students.

medium positive AI Assistance for Discretionary Work: Increasing Feedback Pr... participation in discretionary beneficial tasks (feedback provision) and preserv...

Tool-augmented AI can transform behavioral experimentation from one-shot evaluation into a scalable system for cumulative design learning by learning from experimental data and generating improved domain-relevant interventions.

Authors' synthesis and interpretation based on the two-stage field experiments and performance of AI-generated interventions reported in the paper.

medium positive Beyond One-shot: AI Agents for Learning in Field Experiments scalability and cumulative learning capability of behavioral experimentation sys...

Information asymmetry positively influences industrial robot use, which in turn impacts MVCR.

Empirical analysis reported linking measures of information asymmetry to higher industrial robot application, and subsequent effects on MVCR (mediation/causal pathway analysis).

medium positive Industrial Robot Application and the Manufacturing Value Cha... industrial robot application (primary) and MVCR (secondary/mediated)

Industrial robots affect MVCR through mechanisms including cost reduction, fostering innovation, and enhancing productivity.

Mechanism/mediating analysis (as reported): variables or channels related to costs, innovation indicators, and productivity used to interpret how robot adoption affects MVCR.

medium positive Industrial Robot Application and the Manufacturing Value Cha... manufacturing value chain resilience (MVCR) via mediators: costs, innovation, pr...

Persona responsiveness grows as models lean more on training-data priors and richer context integration.

Interpretive conclusion linking observed greater persona sensitivity to models' reliance on priors/context; presented as consistent with audit patterns (and with retrieval-attribution differences).

medium positive Persona Conditioning of Brand Recommendations in Retrieval-A... relationship between model reliance on priors/context and persona responsiveness

A strategic labor division emerged: the LLM serves as a generative engine to mitigate teacher burnout.

Claim in the abstract describing the role allocation observed in the system; implies LLMs reduced teacher workload/burnout based on the system's deployment and analysis. No numeric measure of burnout provided in the abstract.

medium positive Double-Edged Sword or Sharp Tool? Designing and Evaluating T... teacher burnout / workload

Embodied AI shapes collaboration in complex ways, and social cues critically guide teamwork dynamics.

Synthesis and interpretation of experimental findings (performance variability, completion rates, time, errors, conversational analyses) presented in the paper; this is a theoretical/concluding claim derived from reported results rather than a single empirical estimate.

medium positive Teaming Up with Artificial Agents in Non-routine Analytical ... influence of social cues/embodiment on teamwork dynamics

Beyond replacing repetitive manual labor, AI has penetrated into complex cognitive labor fields once deemed hard to automate, reshaping industry work paradigms, blurring traditional occupational boundaries, and triggering an unprecedented structural transformation in the labor market.

Framing/background claim in the paper describing observed trends and technological developments; the excerpt does not cite specific empirical tests or data for this broad statement.

medium positive Impact of artificial intelligence innovation on labor struct... penetration of AI into complex cognitive tasks / automation exposure of cognitiv...

The utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use.

Stated generalization claim in the paper (conceptual / engineering claim about plug-in compatibility; may be supported by implementation details or experiments but the abstract states it as a general advantage).

medium positive Utility-Aware Multimodal Contrastive Learning for Product Im... ability to embed component into generative models to improve commercial outcomes

The framework closes scheduling inefficiencies of up to 28%.

Paper claims the constructs close documented gaps including scheduling inefficiencies of up to 28%; the abstract does not specify the empirical study, dataset, or sample size supporting this percentage.

medium positive Workforce Unit Abstraction for Governing Hybrid Human and Ar... scheduling inefficiency (presumably measured as percent inefficiency in scheduli...

Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

Interpretive/mechanistic claim presented by the authors, likely supported by qualitative analysis or ablations in the paper (mechanism explanation).

medium positive SIA: Self Improving AI with Harness & Weight Updates mechanistic roles of harness updates vs weight updates

Human-generated translation data has acquired a premium status in the era of model collapse, increasing its value to model developers.

Argumentative synthesis comparing open vs proprietary models, discussions of 'model collapse' and industry preferences for human-generated data; the paper draws on contemporary discourse and examples rather than presenting new quantitative estimates. No numerical sample reported.

medium positive Translators as Invisible Teachers of AI: Copyright, Translat... market valuation/premium of human-generated data for models

The cumulative-languages effect grows with time since adoption, consistent with a Bayesian-learning model in which AI provides free signals about unfamiliar technologies and lowers the switching barrier.

Dynamic analysis of treatment effects over time since adoption in the same panel; authors compare empirical dynamic pattern to predictions from a Bayesian-learning theoretical model.

medium positive Coding Beyond Your Training: Claude Code and the Technologic... growth in cumulative lifetime languages over time since adoption

In the live panel the contract prevents realized loss across all three models at low budget while differing in underwriting persistence under denial: model identity is an actuarial underwriting variable.

Live Postgres panel experiment with three Azure-hosted models; reported outcomes: no realized loss at low budget and differences in underwriting persistence by model identity.

medium positive Insuring Every Action: An Authority Frontier Framework for R... realized loss prevention and underwriting persistence under denial across models

AI feedback may provide the greatest benefit where access to timely critique is otherwise limited (implied by stronger effects in non-English regions, less-embedded manuscripts, lower-h-index teams, and earlier career stages).

Interpretation of heterogeneous treatment effects from the randomized experiment; subgroup patterns indicating larger effects where conventional access to critique is plausibly limited.

medium positive Human-AI Collaboration in Science at Scale: A Global Large-s... relative benefit of AI feedback across contexts (inferred from heterogeneous eff...

The results inform industrial policies focused on workforce adaptation and managing the digital transition in manufacturing.

Policy implication drawn by the authors from the empirical results (positive association between digital transformation and labor demand, plus heterogeneous effects).

medium positive How Does Digital Transformation Reshape Manufacturing Firms'... policy relevance for workforce adaptation and digital transition management

Rising employee digital literacy (from digital transformation) promotes both the amount of labor demanded and the intensity of factor input.

Mechanism/mediation analysis reported in the paper linking digital transformation → employee digital literacy → labor demand and factor-input intensity (Chinese A-share manufacturing firms, 2011–2024). (Sample size not stated in provided text.)

medium positive How Does Digital Transformation Reshape Manufacturing Firms'... labor demand and intensity of factor input

FastKernels substantially exceeds upstream references on under-served architectures.

Comparative performance reported versus upstream reference implementations on certain 'under-served' architectures (asserted in abstract; specific architectures and numeric gains not given there).

medium positive FastKernels: Benchmarking GPU Kernel Generation in Productio... inference performance relative to upstream references on under-served architectu...

FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving.

Benchmarking experiments comparing FastKernels' inference performance to vLLM and SGLang on mainstream LLM serving workloads (stated in abstract; specifics of benchmarks/tasks not given there).

medium positive FastKernels: Benchmarking GPU Kernel Generation in Productio... inference throughput / runtime performance (parity vs. vLLM and SGLang)

The simplest practical fix for evaluation pipelines is to use a fresh context per item; when batching is unavoidable, balancing the history helps reduce bias.

Empirical recommendation based on experiments showing batch-history-induced bias and mitigation via fresh contexts and balanced histories (reported as practical guidance).

medium positive AMEL: Accumulated Message Effects on LLM Judgments effectiveness of mitigation strategies (fresh context per item; balancing histor...

Gemini converts more turns into deep conquest chains, even though it is not the cleanest runtime.

Trace analysis from the provider championship indicating a higher rate of turns leading to deep conquest sequences for Gemini, along with observations about runtime cleanliness/reliability.

medium positive Evaluating Large Language Models as Live Strategic Agents: P... conversion_rate_of_turns_into_deep_conquest_chains

Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches.

Analysis of saved planning traces from the provider championship showing higher frequency of terminal-objective references by Gemini and an increasing trend near victory.

medium positive Evaluating Large Language Models as Live Strategic Agents: P... frequency_of_terminal_objective_references (objective tracking)

A regional integration strategy is critical to achieving coordinated development of digital talent agglomeration and industrial digitalization and thereby promoting regional economic growth.

Policy implication offered by the authors, motivated by regional heterogeneity in empirical results (e.g., positive interaction in Yangtze River Delta versus deviations elsewhere). This is presented as a recommendation rather than a directly tested causal claim.

medium positive Emerging Technology-Driven Development: The Interactive Rela... coordination of digital talent and industrial digitalization / regional economic...

People are increasingly turning to AI assistance for simple tasks (e.g., arithmetic, spell-check, answering simple questions).

Background/contextual statement in the paper's abstract; not tied to the paper's reported empirical studies within the abstract itself.

medium positive The efficiency-gain illusion: People underestimate the rate ... trend_in_AI_adoption_for_simple_tasks

Depending on context, AI can either complement human skill development by amplifying independent reasoning or act as a substitute that undermines such reasoning; therefore regulating AI access and usage will be important for promoting skill development in the presence of AI assistance.

Interpretation and policy implication drawn from the controlled experiment's observed variation by AI usage intensity and informativeness (experimental details and sample size not provided in abstract).

medium positive The Impact of AI Usage and Informativeness on Skill Developm... policy relevance for skill development (recommendation to regulate AI access/usa...

Engagement rises to 1.35 baseline.

Reported engagement metric in paper based on telemetry; phrasing in paper is ambiguous ('rises to 1.35 baseline').

medium positive Privacy-by-Design Adaptive Group Assignment for Digital Life... engagement (as reported in paper)

Prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique [compiling procedures into model weights / subterranean agents] works.

Citation/listing of six prior systems (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) asserted to demonstrate the approach; empirical/experimental results in those prior works are invoked as support.

medium positive Compiling Agentic Workflows into LLM Weights: Near-Frontier ... feasibility/effectiveness of compiling procedures into model weights (approach s...

Recent work has shown this [orchestration] architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a].

Citation to Dennis et al., 2026a; claim refers to experimental results in that prior work comparing orchestration vs. providing procedures in frontier model system prompts on procedural tasks.

medium positive Compiling Agentic Workflows into LLM Weights: Near-Frontier ... performance on procedural tasks (dominance of system-prompted frontier model app...

« Prev 1 2 3 … 240 241 242 … 298 299 Next »