Evidence (14922 claims)
Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.
The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).
Browse by theme
Nine broad, paper-level topics. Click one to filter the claims below.
Adoption
9047 claims
Filter claims →
Productivity
8066 claims
Filter claims →
Governance
7278 claims
Filter claims →
Human-AI Collaboration
6912 claims
Filter claims →
Org Design
4439 claims
Filter claims →
Innovation
4359 claims
Filter claims →
Labor Markets
3652 claims
Filter claims →
Skills & Training
3018 claims
Filter claims →
Inequality
2160 claims
Filter claims →
Claims by outcome category
Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 795 | 210 | 105 | 955 | 2131 |
| Governance & Regulation | 886 | 414 | 197 | 126 | 1654 |
| Organizational Efficiency | 826 | 204 | 129 | 87 | 1257 |
| Technology Adoption Rate | 681 | 259 | 128 | 110 | 1189 |
| Research Productivity | 464 | 138 | 65 | 349 | 1028 |
| Output Quality | 503 | 196 | 61 | 53 | 813 |
| Decision Quality | 351 | 180 | 84 | 51 | 673 |
| AI Safety & Ethics | 238 | 288 | 71 | 34 | 637 |
| Firm Productivity | 455 | 58 | 92 | 20 | 631 |
| Market Structure | 186 | 172 | 123 | 25 | 511 |
| Task Allocation | 222 | 70 | 76 | 34 | 407 |
| Innovation Output | 238 | 28 | 48 | 18 | 334 |
| Skill Acquisition | 177 | 62 | 62 | 17 | 318 |
| Employment Level | 107 | 57 | 108 | 13 | 287 |
| Fiscal & Macroeconomic | 135 | 72 | 44 | 26 | 284 |
| Firm Revenue | 172 | 50 | 28 | 5 | 256 |
| Consumer Welfare | 121 | 68 | 45 | 12 | 246 |
| Task Completion Time | 183 | 33 | 10 | 13 | 240 |
| Inequality Measures | 45 | 126 | 50 | 6 | 227 |
| Worker Satisfaction | 95 | 74 | 23 | 12 | 204 |
| Error Rate | 77 | 98 | 11 | 4 | 190 |
| Regulatory Compliance | 84 | 73 | 17 | 7 | 181 |
| Automation Exposure | 61 | 61 | 27 | 14 | 166 |
| Training Effectiveness | 98 | 21 | 14 | 19 | 154 |
| Wages & Compensation | 78 | 37 | 25 | 6 | 146 |
| Developer Productivity | 105 | 18 | 14 | 6 | 144 |
| Team Performance | 87 | 17 | 28 | 10 | 143 |
| Job Displacement | 12 | 83 | 23 | 1 | 119 |
| Hiring & Recruitment | 53 | 8 | 8 | 3 | 72 |
| Social Protection | 39 | 17 | 8 | 2 | 66 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 50 | 6 | 1 | 62 |
| Labor Share of Income | 17 | 20 | 17 | — | 54 |
| Worker Turnover | 15 | 15 | — | 3 | 33 |
| Industry | — | — | — | 1 | 1 |
Firms that successfully combine AI with learning and knowledge coordination can reduce inefficiencies, accelerate innovation cycles and improve overall performance.
Authors' conclusion and managerial implication derived from observed associations in the survey (AIDLC → KO → OI → IP).
AI can reduce knowledge gaps and help employees adapt to change; well-designed AI systems complement human creativity, improve judgment and reduce repetitive tasks rather than simply replacing workers.
Authors' discussion and normative claim drawing on study findings and literature; not presented as a directly tested causal result in the survey.
This LLM-based retrieval ensures that small creative variants from the advertiser yield consistent and explainable delivery results to the user.
Paper asserts that semantic-aware retrieval produces consistent and explainable delivery across small creative perturbations; claimed empirical support via online validation/experiments but no quantitative numbers provided in excerpt.
The findings offer practical implications for corporate R&D strategies and innovation policy design in the era of AI.
Discussion/implications section asserting that the study's findings can inform corporate R&D and policy design.
The study elucidates the structural pathways of knowledge flow from science to technology in AI.
Combined analysis of patent–publication citation links and semantic topic mapping intended to reveal structural knowledge-flow pathways.
The analysis traces key technological trends in AI across the studied period.
Results from topic modeling and longitudinal analysis of patent and cited-publication topics across 2002–2021.
Text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework (claimed as a first-time result).
Synthesis claim based on the collection of experiments across the six tasks and ablations reported in the paper; presented as a novel, unifying demonstration.
The self-evolving verification layer improves verifier reliability using execution-grounded feedback.
Design and experimental claim in the paper that the verification layer is self-evolving and that it enhances verifier reliability via execution-grounded feedback loops.
Human-governed collaboration is the most credible deployment paradigm.
Policy/recommendation from the paper based on cross-stage analysis and synthesis; not presented as the result of a controlled experiment in the excerpt.
Only RL-based predictions yield product-repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains.
Comparison of recommended repositioning decisions derived from RL versus those derived from observed (actual) trajectories and from heuristic models; reported that RL recommendations match actual-derived recommendations and produce similar estimated profit gains. No numerical profit figures or sample sizes are provided in the excerpt.
RL-based trajectories provide more accurate estimates of impulse purchase rates and shelf traffic densities than TSP and PNN.
Model-based comparisons against real-world trajectory data showing that outputs from RL more closely match observed impulse purchase rates and shelf traffic densities; specific quantitative comparisons and sample sizes not provided in the excerpt.
Extensive online analysis and A/B testing demonstrate GrowthGR's positive impact on the overall ecosystem value.
Paper reports extensive online analysis and A/B testing as supporting evidence (no further quantitative details or sample sizes provided in the excerpt).
Behavioral studies report that compact trajectories correlate with higher resolution rates.
Statement summarizing prior behavioral studies in SE literature (no specific study or sample size cited in excerpt).
Behavioral studies report that short error cascades correlate with higher resolution rates.
Statement summarizing prior behavioral studies in SE literature (no specific study or sample size cited in excerpt).
Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates (e.g., that a test step follows a code modification).
Statement summarizing prior behavioral studies in SE literature (no specific study or sample size cited in excerpt).
Hierarchical decomposition without deliberation achieves the best absolute performance for most models.
Observed performance rankings across the evaluated configurations and models (six models across five model families) in the CybORG CAGE-2 evaluation (3,475 episodes), comparing monolithic ReAct vs. delegation to specialized sub-agents with and without deliberation tools.
Effective AI implementation, coupled with employee training and transparent communication, can reduce resistance and anxiety among employees.
Interpretation and conclusion drawn from the observed negative relationship between perceived opportunities and challenges and the pattern of survey responses; presented as a recommended approach in the study.
Our work also highlights the benefits of legislation aimed at protecting individuals' data rights as a counterweight to the tech industry's discourse of exceptionalism, which obscures its dependence on BPOs to externalise labour costs and accountability.
Argument and empirical demonstration in paper that data-rights legislation (GDPR) enabled access to documents and exposed BPO practices; used to argue for policy benefits. (Empirical extent and generalizability not quantified in the excerpt.)
PRIF shifts forensic accounting from reactive detection to proactive prevention, advancing stakeholder trust and industry standards.
Paper's concluding claim about the conceptual shift and expected industry/stakeholder outcomes following PRIF adoption (argumentative/interpretive).
PRIF provides practical benefits including scalable toolkits for firms and policy guidance for regulators with a broader impact on financial governance.
Paper's discussion/recommendations claiming practical toolkits and policy guidance; asserted broader impact on financial governance.
Wage inequality increased due to differential skill adaptation across workers.
Authors' conclusion drawn from observed effects of AI adoption and skill transformation on wage dynamics in the SEM applied to the survey (n=320); statement presented qualitatively in the results/discussion (no inequality coefficient provided in the summary).
AI created opportunities by increasing demand for high-skilled labor.
Authors' interpretation of SEM results and descriptive analysis from the survey of n=320 employees indicating skill-upgrading effects; specific numerical evidence for 'demand for high-skilled labor' not reported in the summary.
Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems.
Claim in abstract that theoretical results can inform firm decisions; implies prescriptive insights derived from the framework (no empirical validation or sample size given in abstract).
Participants reported greater trust in the process under the same conditions where facilitators exerted directional influence on outcomes.
Post-task survey trust measures reported higher trust for facilitator conditions that also showed directional shifts in allocation outcomes (as measured above).
AI-assisted annotation has become standard in large-scale labeling workflows.
Background claim made in the paper's introduction as contextual motivation for the study (no specific evidence or data reported in the abstract).
"Augmented Intelligence" models, which combine human contextual judgment with algorithmic precision, reduce attrition by 22% compared with complete automation.
Reported comparative result in the paper's analysis (paper claims comparative attrition rates between augmented and fully automated approaches; exact data source not explicitly tied to one of the stated samples in the abstract).
The shift toward solo entry is particularly pronounced in categories that historically favored team-based ventures.
Category-level breakdowns within the Product Hunt dataset showing larger increases in solo-founder launches in categories with a historical bias toward team-based ventures.
Across the (lambda, kappa) grid both arms pass family-wise scenario-clustered correction (p<0.001 / p=0.008).
Statistical analysis across a grid of governance parameter settings (lambda, kappa) with family-wise scenario-clustered multiple-testing correction; p-values reported for both arms.
Societies have long governed opaque expertise through credentials, monitoring, liability, appeal, and revocation rather than mechanism-level explanation.
Historical/institutional claim made by the authors as conceptual evidence for alternative governance approaches (argument and analogy to existing institutions).
This paper connects formal fairness research with legal and ethical requirements to search for less discriminatory alternatives, offering a principled foundation for evaluating and comparing algorithmic decision systems.
Conceptual discussion linking the theoretical characterization of the Pareto frontier and fairness trade-offs to legal/ethical norms and decision-making practice; proposed framework for evaluation/comparison based on the derived results.
For lenders and investors, wider VTech adoption can enhance valuation accuracy, portfolio transparency and collateral risk assessment, strengthening confidence in property markets and capital allocation.
Interpretation and implications drawn from interview data and theoretical synthesis; no quantitative measurement reported in the study.
Switchcraft enables cost-aware agentic AI deployment without sacrificing correctness.
Synthesis claim based on Switchcraft achieving comparable accuracy (82.9%) while substantially reducing cost (84% reduction) in the paper's experiments.
Based on the findings, firms should invest in proprietary AI models and governments should promote open data initiatives.
Policy recommendations presented in the conclusion, motivated by empirical findings (inverted-U, homogenization trap, heterogeneity).
High-performing, human-comparable legal AI no longer requires the largest externally hosted models.
Conclusion/interpretation in paper based on the Olava Extract results outperforming/competing with frontier models while being self-hosted and smaller.
Fewer hallucinations and unsupported extractions reduce operational risk and downstream review burden in legal workflows.
Argument presented in paper linking lower hallucination/unsupported extraction rates to reduced operational risk and review burden; framed as an important distinction for legal workflows.
Olava Extract achieved the highest precision scores, producing fewer hallucinated and unsupported extractions.
Reported precision metrics and qualitative/quantitative statements about hallucination/unsupported extraction rates in the comparison against frontier models.
Smart manufacturing provides a practical pathway for enhancing economic performance while reducing environmental impact.
Framing/theoretical claim in the paper's introduction motivating the study; supported by cited literature rather than the paper's primary empirical DiD test.
Improvements in firms' resource allocation efficiency enhance their ability to adopt smart manufacturing technologies (mechanism).
Mechanism analysis within the study showing that gains in resource allocation efficiency at the firm level are associated with higher adoption of smart manufacturing after LCCP implementation.
City-level human capital upgrading lowers firms' costs of adopting smart manufacturing technologies, facilitating adoption (mechanism).
Mechanism analysis reported in the paper linking city-level human capital improvements to reduced firm-level adoption costs and increased adoption; likely based on city-level measures of human capital interacting with treatment in the DiD framework.
Generation-protocol variants show that crowding can be reduced through targeted design, making diversity collapse an actionable, development-time evaluation target for population-aware creative AI.
Experimental evidence in the paper demonstrating that modifying generation protocols (design choices) reduces crowding; abstract states results across protocol variants but does not provide quantitative effect sizes or sample counts.
Estimates stabilize with feasible model-only sample sizes.
Empirical/stability analysis reported in the paper (abstract claims convergence/stabilization of estimates with feasible numbers of model-only samples), but the abstract does not quantify what 'feasible' means or give sample counts.
Resource-based environmental taxation (the water resource tax reform) can play a role in promoting food security under rigid water constraints.
Interpretation and policy discussion based on the empirical results showing increased grain yield following the reform.
The reform improves water-use efficiency (a channel through which it raises agricultural productivity).
Mechanism analysis in the paper indicating strengthened water-use efficiency following the reform.
Trajectory-level evaluation is essential in regulated domains.
Conclusion drawn by the authors based on the ASR findings (hidden shortcuts, metric blind spots, and remediation gains); presented as a policy/recommendation implication.
A DLM (Schema-1) eliminates the preprocessing pipelines that currently stand between raw tabular data and AI systems that consume it.
Claims based on model's native consumption of raw cell values and experimental demonstrations (design and reported evaluations suggest reduced need for preprocessing; specific operational workflow impacts not quantified in the abstract).
Schema-1 identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain—a task no prior tabular model can perform.
Reported experiments demonstrating industry-sector identification from raw cell values on unseen datasets and cross-domain reliability (details of datasets, number of domains, and metrics not provided in the abstract).
Reinforcement learning in post-training, now the dominant paradigm at the frontier, is structured around task completion and maps more directly onto the task-based architecture of occupational classifications than prior approaches.
Argument based on current ML research practices (framing claim about dominant technical paradigm) and theoretical mapping to task-based occupational taxonomies.
Future progress in AI-based software engineering depends on equipping agents with explicit architectural foresight so generated software is maintainable, not just functional.
Conclusion/recommendation based on the empirical findings (Reasoning-Complexity Trade-off and Volume-Quality Inverse Law) and failures of prompting and correctness to mitigate decay.
Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine.
Contextual claim motivating the work; presented as an empirical generalization about production agent pipelines, but not quantified in the abstract.
Small and mid-sized open-weight models are already sufficient for much of the short-horizon, structured tool use work that dominates real agent pipelines.
Aggregate benchmark results across AgentFloor tiers showing high performance of smaller and mid-sized open-weight models on short-horizon structured tasks; supported by the 16,542 scored runs and model comparisons reported in the paper.