Evidence (14922 claims)
Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.
The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).
Browse by theme
Nine broad, paper-level topics. Click one to filter the claims below.
Adoption
9047 claims
Filter claims →
Productivity
8066 claims
Filter claims →
Governance
7278 claims
Filter claims →
Human-AI Collaboration
6912 claims
Filter claims →
Org Design
4439 claims
Filter claims →
Innovation
4359 claims
Filter claims →
Labor Markets
3652 claims
Filter claims →
Skills & Training
3018 claims
Filter claims →
Inequality
2160 claims
Filter claims →
Claims by outcome category
Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 795 | 210 | 105 | 955 | 2131 |
| Governance & Regulation | 886 | 414 | 197 | 126 | 1654 |
| Organizational Efficiency | 826 | 204 | 129 | 87 | 1257 |
| Technology Adoption Rate | 681 | 259 | 128 | 110 | 1189 |
| Research Productivity | 464 | 138 | 65 | 349 | 1028 |
| Output Quality | 503 | 196 | 61 | 53 | 813 |
| Decision Quality | 351 | 180 | 84 | 51 | 673 |
| AI Safety & Ethics | 238 | 288 | 71 | 34 | 637 |
| Firm Productivity | 455 | 58 | 92 | 20 | 631 |
| Market Structure | 186 | 172 | 123 | 25 | 511 |
| Task Allocation | 222 | 70 | 76 | 34 | 407 |
| Innovation Output | 238 | 28 | 48 | 18 | 334 |
| Skill Acquisition | 177 | 62 | 62 | 17 | 318 |
| Employment Level | 107 | 57 | 108 | 13 | 287 |
| Fiscal & Macroeconomic | 135 | 72 | 44 | 26 | 284 |
| Firm Revenue | 172 | 50 | 28 | 5 | 256 |
| Consumer Welfare | 121 | 68 | 45 | 12 | 246 |
| Task Completion Time | 183 | 33 | 10 | 13 | 240 |
| Inequality Measures | 45 | 126 | 50 | 6 | 227 |
| Worker Satisfaction | 95 | 74 | 23 | 12 | 204 |
| Error Rate | 77 | 98 | 11 | 4 | 190 |
| Regulatory Compliance | 84 | 73 | 17 | 7 | 181 |
| Automation Exposure | 61 | 61 | 27 | 14 | 166 |
| Training Effectiveness | 98 | 21 | 14 | 19 | 154 |
| Wages & Compensation | 78 | 37 | 25 | 6 | 146 |
| Developer Productivity | 105 | 18 | 14 | 6 | 144 |
| Team Performance | 87 | 17 | 28 | 10 | 143 |
| Job Displacement | 12 | 83 | 23 | 1 | 119 |
| Hiring & Recruitment | 53 | 8 | 8 | 3 | 72 |
| Social Protection | 39 | 17 | 8 | 2 | 66 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 50 | 6 | 1 | 62 |
| Labor Share of Income | 17 | 20 | 17 | — | 54 |
| Worker Turnover | 15 | 15 | — | 3 | 33 |
| Industry | — | — | — | 1 | 1 |
Autonomous penetration capability continues to improve alongside advances in overall model capability.
Observed monotonic/positive relationship reported between model capability (presumably model size or general capability metrics) and penetration success across evaluated models.
CloudCons and the authors' analyses provide actionable guidelines and vital insights for real-world deployment decisions of forecasting-driven consolidation.
Authors' synthesis and recommended calibration rules based on their empirical experiments.
Hosting the precomputed KV provider-side (removing egress) enables reuse without the egress cost, analogous to production prompt-caching.
Architectural argument and analogy to existing provider-side prompt-caching practices described by authors.
Structured LLM pipelines can provide scalable, low-effort pre-mediation support broadly comparable to human mediators on short-term self-reported preparation outcomes.
Authors' synthesis of findings across the two controlled experiments (short-term self-reported measures and behavioral/metric improvements).
The study triangulates concept, practice and market evidence in a single crosswalk, clarifying where Aviation 4.0 potential has materialised and where principal barriers persist, providing an evidence-based roadmap for MRO executives and policymakers.
Synthesis of the paper's multiple methods (literature review, survey, interviews, case studies) leading to an asserted contribution; method = cross-method triangulation within the study.
The geometry replicates under an encoder swap to BGE: 'LLM-class OAI lead' replicates at 3.37x.
Encoder swap stress-test described by authors (embedding encoder changed to BGE), with reported replication factor 3.37x for LLM-class OAI lead.
Computer changes the scope of work that users attempt: queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks bundling interdependent subtasks, and unlock work activities that are essentially absent from Search usage among the same users.
Analysis of query content and categories in Perplexity production data comparing the types of work attempted with Computer versus Search within the same user base (content classification of occupational domain, cognitive level, expertise breadth, and task composition).
Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement.
Qualitative and quantitative analysis of product logs showing Computer performing decomposition and end-to-end execution steps that correspond to manual orchestration by Search users in matched-session comparisons.
Self-evolution (rewriting adapter contents from prior trajectories) further improves SIGA, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration.
Experimental comparison in the paper showing performance improvements after applying self-evolution to SIGA; claims of highest held-out mean and parity/outperformance vs hand-designed configuration.
Participatory AI systems substantially improve on each contributor's original priorities.
Experiments described in the paper comparing the participatory/compositional system's outputs to individual contributors' models, showing improvement relative to contributors' stated priorities (no numerical details in excerpt).
Survey responses and interviews indicate a broader range of emerging competencies, suggesting the spectrum of required advanced digital skills is likely to expand in the near future.
Paper synthesizes survey and interview findings to infer an expanding set of competencies; this is a forward-looking interpretation rather than a strictly observed quantitative trend; no forecast model or time-series data reported.
A competent human paired with a frontier model can outperform current peer review.
Author argues—based on the experimental results and comparative performance—that human+frontier-model collaboration can exceed existing peer-review processes in finding/correcting errors.
These findings and institutional lessons extend beyond programming to credentialing systems (medical and legal boards, professional certification) that certify skill in a workforce increasingly shaped by AI.
Generalization / policy claim offered by authors (normative extrapolation from programming contest evidence to other credentialing systems).
Two levers follow from the contrast: (1) how AI is integrated into training, since within the screened pool AI-style practice coincides with stronger non-AI-aided performance; and (2) the design of AI-prohibited evaluation gates as a type-separating institution.
Interpretation and policy implication drawn from empirical results (conceptual recommendation; not a directly tested intervention in the paper).
Inside the AI-prohibited ICPC environment, a shift toward AI-style practice predicts higher non-AI-aided scores for AI-era entrants.
Within-ICPC empirical analysis comparing entrants across eras (pre/post AI) and relating practice signature to ICPC non-AI-aided scores; specific sample size and estimates not provided in abstract.
Archi enables fully private management of sensitive data by using locally-hosted, open-weight models.
Paper statement tying local hosting of open-weight models to the ability to manage sensitive data privately; no technical privacy audit or measurements reported in the quoted text.
Locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.
Paper's comparative claim about model performance based on the same evaluation (human and automated grading of production question set); asserts that locally-hosted open-weight models are competitive and support private data management.
The system proves effective at operational tasks, resolving real-world queries posed by CMS operators.
Results reported from the evaluation using operator feedback and the production question set graded by human and automated panels (no numerical success rates provided in the text quoted).
Subgroup analysis reveals AACT can be particularly beneficial for some decision-makers such as those very familiar with AI technologies.
Subgroup analysis reported in the house price prediction case study indicating heterogenous effects by familiarity with AI (no subgroup sample sizes provided in abstract).
Existing insurance products are adapting to address agentic-AI exposures.
Market and product analysis discussed in the paper evaluating how cyber, professional liability, product liability and other products are being modified; descriptive review rather than systematic empirical measurement.
The composition pattern suggests AI-consistent drafting includes a modest, suggestive increase in name-inferred female plaintiffs.
Analysis of name-inferred gender among AI-flagged complaints compared to baseline; authors describe the increase as modest and suggestive.
These findings can guide AI risk prioritization and clarify expert expectations about who should bear responsibility for mitigation.
Author interpretation of study results; paper asserts applicability of findings to policy/prioritization.
AI assistance shows promise for increasing discretionary but beneficial work (tasks users intend but often skip) while preserving human control over final outcomes.
Synthesis/generalization based on randomized field experiment results (increased feedback provision and length; no negative effects on usefulness or time per character) and supporting qualitative interview findings. Empirical data from a 300-level ML course with 11 TAs and 88 students.
Tool-augmented AI can transform behavioral experimentation from one-shot evaluation into a scalable system for cumulative design learning by learning from experimental data and generating improved domain-relevant interventions.
Authors' synthesis and interpretation based on the two-stage field experiments and performance of AI-generated interventions reported in the paper.
Information asymmetry positively influences industrial robot use, which in turn impacts MVCR.
Empirical analysis reported linking measures of information asymmetry to higher industrial robot application, and subsequent effects on MVCR (mediation/causal pathway analysis).
Industrial robots affect MVCR through mechanisms including cost reduction, fostering innovation, and enhancing productivity.
Mechanism/mediating analysis (as reported): variables or channels related to costs, innovation indicators, and productivity used to interpret how robot adoption affects MVCR.
Persona responsiveness grows as models lean more on training-data priors and richer context integration.
Interpretive conclusion linking observed greater persona sensitivity to models' reliance on priors/context; presented as consistent with audit patterns (and with retrieval-attribution differences).
A strategic labor division emerged: the LLM serves as a generative engine to mitigate teacher burnout.
Claim in the abstract describing the role allocation observed in the system; implies LLMs reduced teacher workload/burnout based on the system's deployment and analysis. No numeric measure of burnout provided in the abstract.
Embodied AI shapes collaboration in complex ways, and social cues critically guide teamwork dynamics.
Synthesis and interpretation of experimental findings (performance variability, completion rates, time, errors, conversational analyses) presented in the paper; this is a theoretical/concluding claim derived from reported results rather than a single empirical estimate.
Beyond replacing repetitive manual labor, AI has penetrated into complex cognitive labor fields once deemed hard to automate, reshaping industry work paradigms, blurring traditional occupational boundaries, and triggering an unprecedented structural transformation in the labor market.
Framing/background claim in the paper describing observed trends and technological developments; the excerpt does not cite specific empirical tests or data for this broad statement.
The utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use.
Stated generalization claim in the paper (conceptual / engineering claim about plug-in compatibility; may be supported by implementation details or experiments but the abstract states it as a general advantage).
The framework closes scheduling inefficiencies of up to 28%.
Paper claims the constructs close documented gaps including scheduling inefficiencies of up to 28%; the abstract does not specify the empirical study, dataset, or sample size supporting this percentage.
Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.
Interpretive/mechanistic claim presented by the authors, likely supported by qualitative analysis or ablations in the paper (mechanism explanation).
Human-generated translation data has acquired a premium status in the era of model collapse, increasing its value to model developers.
Argumentative synthesis comparing open vs proprietary models, discussions of 'model collapse' and industry preferences for human-generated data; the paper draws on contemporary discourse and examples rather than presenting new quantitative estimates. No numerical sample reported.
The cumulative-languages effect grows with time since adoption, consistent with a Bayesian-learning model in which AI provides free signals about unfamiliar technologies and lowers the switching barrier.
Dynamic analysis of treatment effects over time since adoption in the same panel; authors compare empirical dynamic pattern to predictions from a Bayesian-learning theoretical model.
In the live panel the contract prevents realized loss across all three models at low budget while differing in underwriting persistence under denial: model identity is an actuarial underwriting variable.
Live Postgres panel experiment with three Azure-hosted models; reported outcomes: no realized loss at low budget and differences in underwriting persistence by model identity.
AI feedback may provide the greatest benefit where access to timely critique is otherwise limited (implied by stronger effects in non-English regions, less-embedded manuscripts, lower-h-index teams, and earlier career stages).
Interpretation of heterogeneous treatment effects from the randomized experiment; subgroup patterns indicating larger effects where conventional access to critique is plausibly limited.
The results inform industrial policies focused on workforce adaptation and managing the digital transition in manufacturing.
Policy implication drawn by the authors from the empirical results (positive association between digital transformation and labor demand, plus heterogeneous effects).
Rising employee digital literacy (from digital transformation) promotes both the amount of labor demanded and the intensity of factor input.
Mechanism/mediation analysis reported in the paper linking digital transformation → employee digital literacy → labor demand and factor-input intensity (Chinese A-share manufacturing firms, 2011–2024). (Sample size not stated in provided text.)
FastKernels substantially exceeds upstream references on under-served architectures.
Comparative performance reported versus upstream reference implementations on certain 'under-served' architectures (asserted in abstract; specific architectures and numeric gains not given there).
FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving.
Benchmarking experiments comparing FastKernels' inference performance to vLLM and SGLang on mainstream LLM serving workloads (stated in abstract; specifics of benchmarks/tasks not given there).
The simplest practical fix for evaluation pipelines is to use a fresh context per item; when batching is unavoidable, balancing the history helps reduce bias.
Empirical recommendation based on experiments showing batch-history-induced bias and mitigation via fresh contexts and balanced histories (reported as practical guidance).
Gemini converts more turns into deep conquest chains, even though it is not the cleanest runtime.
Trace analysis from the provider championship indicating a higher rate of turns leading to deep conquest sequences for Gemini, along with observations about runtime cleanliness/reliability.
Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches.
Analysis of saved planning traces from the provider championship showing higher frequency of terminal-objective references by Gemini and an increasing trend near victory.
A regional integration strategy is critical to achieving coordinated development of digital talent agglomeration and industrial digitalization and thereby promoting regional economic growth.
Policy implication offered by the authors, motivated by regional heterogeneity in empirical results (e.g., positive interaction in Yangtze River Delta versus deviations elsewhere). This is presented as a recommendation rather than a directly tested causal claim.
People are increasingly turning to AI assistance for simple tasks (e.g., arithmetic, spell-check, answering simple questions).
Background/contextual statement in the paper's abstract; not tied to the paper's reported empirical studies within the abstract itself.
Depending on context, AI can either complement human skill development by amplifying independent reasoning or act as a substitute that undermines such reasoning; therefore regulating AI access and usage will be important for promoting skill development in the presence of AI assistance.
Interpretation and policy implication drawn from the controlled experiment's observed variation by AI usage intensity and informativeness (experimental details and sample size not provided in abstract).
Engagement rises to 1.35 baseline.
Reported engagement metric in paper based on telemetry; phrasing in paper is ambiguous ('rises to 1.35 baseline').
Prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique [compiling procedures into model weights / subterranean agents] works.
Citation/listing of six prior systems (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) asserted to demonstrate the approach; empirical/experimental results in those prior works are invoked as support.
Recent work has shown this [orchestration] architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a].
Citation to Dennis et al., 2026a; claim refers to experimental results in that prior work comparing orchestration vs. providing procedures in frontier model system prompts on procedural tasks.