Evidence (14922 claims)
Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.
The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).
Browse by theme
Nine broad, paper-level topics. Click one to filter the claims below.
Adoption
9047 claims
Filter claims →
Productivity
8066 claims
Filter claims →
Governance
7278 claims
Filter claims →
Human-AI Collaboration
6912 claims
Filter claims →
Org Design
4439 claims
Filter claims →
Innovation
4359 claims
Filter claims →
Labor Markets
3652 claims
Filter claims →
Skills & Training
3018 claims
Filter claims →
Inequality
2160 claims
Filter claims →
Claims by outcome category
Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 795 | 210 | 105 | 955 | 2131 |
| Governance & Regulation | 886 | 414 | 197 | 126 | 1654 |
| Organizational Efficiency | 826 | 204 | 129 | 87 | 1257 |
| Technology Adoption Rate | 681 | 259 | 128 | 110 | 1189 |
| Research Productivity | 464 | 138 | 65 | 349 | 1028 |
| Output Quality | 503 | 196 | 61 | 53 | 813 |
| Decision Quality | 351 | 180 | 84 | 51 | 673 |
| AI Safety & Ethics | 238 | 288 | 71 | 34 | 637 |
| Firm Productivity | 455 | 58 | 92 | 20 | 631 |
| Market Structure | 186 | 172 | 123 | 25 | 511 |
| Task Allocation | 222 | 70 | 76 | 34 | 407 |
| Innovation Output | 238 | 28 | 48 | 18 | 334 |
| Skill Acquisition | 177 | 62 | 62 | 17 | 318 |
| Employment Level | 107 | 57 | 108 | 13 | 287 |
| Fiscal & Macroeconomic | 135 | 72 | 44 | 26 | 284 |
| Firm Revenue | 172 | 50 | 28 | 5 | 256 |
| Consumer Welfare | 121 | 68 | 45 | 12 | 246 |
| Task Completion Time | 183 | 33 | 10 | 13 | 240 |
| Inequality Measures | 45 | 126 | 50 | 6 | 227 |
| Worker Satisfaction | 95 | 74 | 23 | 12 | 204 |
| Error Rate | 77 | 98 | 11 | 4 | 190 |
| Regulatory Compliance | 84 | 73 | 17 | 7 | 181 |
| Automation Exposure | 61 | 61 | 27 | 14 | 166 |
| Training Effectiveness | 98 | 21 | 14 | 19 | 154 |
| Wages & Compensation | 78 | 37 | 25 | 6 | 146 |
| Developer Productivity | 105 | 18 | 14 | 6 | 144 |
| Team Performance | 87 | 17 | 28 | 10 | 143 |
| Job Displacement | 12 | 83 | 23 | 1 | 119 |
| Hiring & Recruitment | 53 | 8 | 8 | 3 | 72 |
| Social Protection | 39 | 17 | 8 | 2 | 66 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 50 | 6 | 1 | 62 |
| Labor Share of Income | 17 | 20 | 17 | — | 54 |
| Worker Turnover | 15 | 15 | — | 3 | 33 |
| Industry | — | — | — | 1 | 1 |
The reasoning preset (1:5 input:output) elevates frontier closed models that the chat preset penalizes on price.
Observed leaderboard reordering when using a reasoning workload preset (1:5) compared to the chat preset; specific elevation of frontier closed models noted (no numeric counts provided in excerpt).
To our knowledge, this is the first demonstration of an AI agentic system autonomously identifying and experimentally validating a nontrivial, previously unreported physical mechanism.
Authors' novelty claim, supported by the reported autonomous proposal and experimental validation of the optical bilinear interaction in their study.
Qiushi Engine converts an abstract coherence-order theory into experimental observables, providing the first observation of this class of coherence-order structure.
Reported experimental procedure translating coherence-order theory into measurable observables and claiming the first observation of that class of structure; experimental data and analysis presented in the paper supporting the observation.
Gradient attribution is established as a computationally validated signal for model-informed reward allocation in participatory weather sensing.
Synthesis/conclusion in paper based on the computational experiments and evaluations (results across >400 configurations demonstrating fidelity and limitations).
Attribution captures near-optimal sensor placement utility with monotonically faithful payments.
Comparative experiments in the paper showing that gradient attribution corresponds closely to near-optimal sensor placement utility and yields monotonically faithful payment signals (experimental comparisons to optimal/benchmark placements).
Principled symbolic abstraction bridges generative AI and the numerical precision required for engineering design.
Broader conclusion drawn by the authors based on the reported modular architecture, symbolic lifting operator, and experimental improvements in geometric error and structural validity.
These simulations produce rich experiential learning signals, whose effectiveness is validated by significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations.
Evaluation experiments reported in the paper claiming statistically/qualitatively significant improvements in agent performance on in-domain and out-of-domain productivity benchmarks after training on simulation-generated signals.
Embedding governance into agent reasoning produces more consistent, explainable, and auditable compliance than external enforcement.
Comparative claim asserted in the paper, apparently supported by the reported production deployment results (95% compliance, zero false escalations); explicit experimental comparison details are not provided in the abstract.
Technology has increased efficiency in organisations based in large cities in India.
Review result statement claiming observed efficiency gains in urban organisations according to the literature summarized; based on reviewed studies (no single sample size reported in excerpt).
Controversial questions frequently result in an AIO.
Analysis of the 11,500-query benchmark with annotation/identification of 'controversial' queries and observed higher incidence of AIO generation for those queries.
AI-enabled process capability contributes to sustained enterprise value growth.
Authors report empirical associations between PI (AI-enabled process capability) and measures tied to enterprise value (Feltham–Ohlson based abnormal earnings / profitability) across the panel sample.
Prompt modifications, Chain-of-Thought (CoT) reasoning, and visual token reduction can mitigate visual-priming effects on VLM behavior (with varying effectiveness across models).
Intervention experiments applying prompt engineering, CoT-style prompts, and reducing the number of visual tokens to observe whether these interventions reduce the influence of image content and color cues on IPD choices across several VLMs. (Abstract states these mitigation strategies were explored and their effectiveness varied by model; precise quantitative mitigation effects not provided in abstract.)
Generative AI is increasingly embedded in China's short-video production.
Authors' background claim supported by qualitative data collection with 16 in-depth interviews of short-video creators active on Xiaohongshu and Douyin; observational grounding in participant reports.
Reliability did not come from the base model alone; it emerged from the operating layer around the model (prompt compilation, typed controls, policy validation, execution guards, memory design, and trace-level observability).
Comparative analysis of system design and observed failure modes during pre-launch testing and deployment; qualitative/operational reasoning linking operating-layer components to improved reliability.
The proposed, validated model can equip fintech managers and regulators with a governance-based approach to tackling algorithmic bias and better position them to engender trust and financial inclusion.
Concluding assertion based on the integrated framework developed from the SLR (45 papers) and the structured five-expert validation; positioned as the intended practical utility of the model rather than an empirically measured outcome.
Participants showed strong willingness to substitute human IT support at costs well below human benchmarks.
Participant responses and willingness-to-pay / substitution questions collected in the controlled study (reported qualitatively in the paper); comparison to unspecified human-cost benchmarks.
Claude 3.5 Sonnet aligns with a narrative funder profile, showing greater responsiveness to qualitative aspects of the pitch, somewhat higher funding levels, and strong cross-run reliability.
Comparative observations across the experiment: Claude 3.5 Sonnet was more responsive to qualitative information in pitch decks, tended to recommend higher funding levels, and demonstrated strong reliability across runs.
Agentic Architect is the first end-to-end open-source framework for agentic AI architecture exploration and optimization.
Authors' claim in the abstract and paper asserting novelty and open-source release. No independent verification provided in the abstract.
Our results establish C2C as a testbed for studying and building LM-based agents that can navigate the sophisticated coordination required for real-world deployments.
Authors' interpretation/implication based on the experiments and dataset produced (conclusion statement).
The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise.
Conclusion asserts the deployed system can support interactive search, cohort generation, and downstream LLM applications and that it does not require specialized informatics expertise for these uses.
Health-system-scale semantic search is both technically and operationally feasible.
Authors conclude feasibility based on successful deployment, measured latency, cost, retrieval quality, and clinical utility experiments.
The paper's findings provide practical guidance for selecting between joint and modular training modalities based on environmental conditions to optimize reinforcement learning–based scheduling performance.
Authors' stated implication/conclusion based on their sensitivity analysis and comparative evaluations across environmental regimes.
Foundation models are strong potential solutions for scalable and generalizable forecasting in the energy domain, particularly in data-constrained and privacy-sensitive settings.
Synthesis and interpretation of benchmark results showing generalization across datasets and better performance in scenarios with limited data; argument made in paper conclusions.
These results establish a practical pathway for extending industrial automation with learning-based methods.
Authors' concluding claim based on the reported deployment results (interpretation/implication rather than a new empirical measurement).
The paper proposes a safety-oriented inductive bias for rational AI decision-makers whose desiderata align with implementable policy constraints in high-stakes, low-signal situations.
Theoretical proposal and normative argument in the paper linking the proposed inductive bias (negligibility threshold and associated norms) to policy-implementable constraints; argued rather than empirically demonstrated.
These patterns are consistent with transfer emerging through accumulated interaction between owners (or owners' computer environments) and their agents in everyday use.
Interpretation offered by the authors based on observed alignment patterns and robustness checks; the paper argues consistency with an interaction-driven transfer mechanism rather than providing a direct experimental causal test.
This transfer persists among agents without explicit configuration.
Subgroup analyses (described in paper) isolating agents lacking explicit configuration settings and comparing behavioral alignment to owners; reported persistence of alignment in that subgroup.
Trade unions have increasingly pursued algorithmic transparency and stronger technology governance rights through collective bargaining, and governments are accelerating legislative initiatives to establish and protect workplace technology rights.
Descriptive review of labor-movement responses and recent government legislative initiatives reported in the literature (case studies and policy reviews).
Using these artifacts shifts human effort toward higher-level design and validation activities.
Reported as a preliminary finding from the exploratory evaluation; the abstract states that human effort shifted from low-level implementation to higher-level design/validation when artifacts were embedded (no sample size or time-allocation metrics provided).
Embedding machine-readable requirements and architectural artifacts stabilizes agent behavior.
Reported as a preliminary finding from the exploratory evaluation comparing approaches; the abstract states that embedding such artifacts stabilizes agent behavior (no numeric metrics or sample size reported).
The central obstacle to agent self-improvement is not what to remember but how to use what has been remembered (which retrieval policy to apply, how to interpret prior outcomes, and when the current strategy itself must change).
Conceptual claim supported by authors' argumentation and by the experimental results (ablation showing gains from reflection/use mechanisms rather than added architectural complexity).
Visibility mechanisms, such as public algorithm registers or role-sensitive explainability, can be effective tools in regaining citizen trust.
Review examines studies on transparency/visibility mechanisms; abstract states these mechanisms are examined for effectiveness but does not report definitive quantitative results or study counts.
By capturing complete interaction traces with human vs. agent code authorship attribution, SWE-chat provides an empirical foundation for moving beyond curated benchmarks towards an evidence-based understanding of how AI agents perform in real developer workflows.
Claims about dataset capabilities and intended use: the dataset contains interaction traces and authorship labels enabling empirical research; asserted by authors as an implication of the dataset contents.
Enterprise deployment of long-horizon decision agents in regulated domains (underwriting, claims adjudication, tax examination) is dominated by retrieval-augmented pipelines despite a decade of increasingly sophisticated stateful memory architectures.
Stated observation/argument in the paper's introduction; no empirical sample size or systematic industry survey reported in the abstract.
An accompanying open-source interactive tool, the Co-creation Provenance Lab, enables policymakers to audit and iteratively improve summaries, establishing genuine human-in-the-loop oversight at scale.
Statement in the paper about an open-source tool released alongside the research; likely demonstration or software repository provided.
AI adoption enhances the reliability of financial reporting and the effectiveness of audits by reducing information asymmetry and strengthening internal monitoring processes.
Argument grounded in theory and supported empirically via SEM showing AI adoption associated with greater reporting transparency and internal control quality, which are linked to higher audit quality.
AI-enabled reporting systems strengthen firm-level governance mechanisms (e.g., reporting transparency and internal controls), which enhances audit quality (governance substitution perspective complemented by institutional and technology diffusion theories).
Theoretical framing (governance substitution, institutional and technology diffusion theories) combined with empirical SEM results linking AI adoption to proxies for governance (reporting transparency, internal control quality) and to audit quality.
Differences in institutional quality, digital infrastructure, and absorptive capacity explain the disparity in technology impacts between GCC and non-GCC countries.
Exploratory/mediation or interaction analysis linking institutional quality, measures of digital infrastructure, and absorptive capacity to heterogeneity in estimated technology effects across countries in the panel.
The capital market evaluates AI investment as a future 'growth option' selectively in industries with strong data infrastructure, digital workforce readiness, and absorptive capacity.
Inference from heterogeneous positive Tobin's Q effect found in the ICT industry and null average effect across all firms; authors argue market valuation responds to industry-specific complementary assets and ecosystem conditions.
Pair programming between students is well studied and known to be beneficial to self-efficacy and academic achievement.
Background literature claim presented in the paper's introduction (cites existing research on pair programming benefits).
Developing and further developed countries only integrate with China, signaling China's expanding influence over the international AI research landscape.
Observed integration patterns in the publication-based collaboration and citation networks showing that (some) developing and further developed countries connect primarily with China rather than the US; comparison to randomized networks.
The calibration mapping suggests Google and OpenAI face conditions most conducive to foreclosure.
Outcomes of the paper's stylized calibration/comparative mapping across four providers (April 2026 data); authors' interpretation.
Artificial intelligence algorithms are increasingly used by firms to set prices.
Statement in paper's introduction/abstract referencing prior adoption trends; no specific empirical study or sample reported in the excerpt.
The proposed approach aligns machine learning with actuarial portfolio optimization by explicitly integrating profit-driven objectives and operational constraints, offering two practical and scalable solutions for risk-based decision-making in real-world insurance settings.
Conceptual claim supported by the combination of methodological design and empirical results presented in the paper (method descriptions + experimental validation).
The balanced ensemble provides the most favourable trade-off between predictive performance, robustness, interpretability, and computational efficiency, making it suitable for deployment in regulated insurance environments.
Authors' synthesis of experimental results (performance, robustness tests, interpretability considerations, and computational efficiency measurements) and discussion regarding regulatory deployment suitability.
These variables (education, gender inclusiveness, digital literacy, perceived fairness) are mutually dependent and the use of AI combined with inclusive policies is necessary to sustainably realize financial inclusion.
Paper asserts mutual dependence based on SEM results and provides a policy recommendation that AI plus inclusive policies are necessary, citing prior literature (Salami et al., 2025; Berg et al., 2019; Fuster et al., 2021).
Synthetic experiments complement the theoretical results and showcase the benefits of collective action across different market regimes.
Simulation-based experiments described in the paper (synthetic experiments across market regimes). Paper does not report a real-world sample size; results are from computational experiments.
Spatial heterogeneity: Eastern regions are driven by knowledge recombination opportunities.
Reported spatial heterogeneity findings indicating Eastern China’s diffusion is driven more by recombination/opportunity measures than by reliance on core hubs.
Spatial heterogeneity: Western regions rely heavily on core technological hubs.
Spatial analysis / heterogeneity results reported by region indicating Western China depends on core technological hubs as diffusion sources or anchors.
Heterogeneity analysis: market-driven enterprises heavily rely on high-value core technologies.
Reported heterogeneity results indicating enterprises (market-driven actors) concentrate on and depend upon core, high-value technologies within identified diffusion paths.