Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Artificial intelligence enhances analytics, automates routine tasks, personalizes interactions, and supports decision-making.
Aggregate finding reported in the abstract based on thematic synthesis of the reviewed literature (160 articles).
There are convergent patterns of AI adoption in human resources, marketing and customer services, logistics, and finance.
Synthesis claim from the systematic review of the 160 included peer‑reviewed articles as reported in the abstract.
Clear specifications, explicit governance, and ongoing human-AI collaboration are critical for successful scaling of regression automation.
Conclusions and recommendations derived from the case study's lessons and mixed-method evaluation.
The Copilot achieves 30-50% code reuse when generating candidate test scripts.
Quantitative result reported in the paper's evaluation (stated 30-50% code reuse in the abstract/summary).
Mixed-method evaluation shows the AI accelerates script authoring and increases throughput.
Empirical claim based on the paper's mixed-method evaluation (qualitative and quantitative data reported in the case study); specific sample sizes not provided in the summary.
Automated regression testing is essential for maintaining rapid, high-quality delivery in Agile and Scrum organizations.
Introductory/position statement in the paper; general premise motivating the case study (no specific empirical test reported).
AIBuildAI ranks first on MLE-Bench with a medal rate of 63.1%, outperforming all existing baseline methods and matching the capability of highly experienced AI engineers.
Empirical evaluation on MLE-Bench reported in the paper (benchmark ranking, metric = medal rate).
AIBuildAI adopts a hierarchical agent architecture in which a manager agent coordinates three specialized sub-agents: a designer for modeling strategy, a coder for implementation and debugging, and a tuner for training and performance optimization; each sub-agent is itself an LLM-based agent capable of multi-step reasoning and tool use, enabling end-to-end automation of the AI model development process that goes beyond the scope of existing AutoML approaches.
System architecture description in the paper (methods/architecture section).
We introduce AIBuildAI, an AI agent that automatically builds AI models from a task description and training data.
Methodological contribution: system design and implementation described in the paper (introduction/methods).
This tension reveals a pattern we call 'bounded delegation': developers wanted AI to absorb the assembly work surrounding their craft, never the craft itself.
Interpretive result from the paper's qualitative thematic analysis of survey responses (n=860), labeled by the authors as the 'bounded delegation' pattern.
Developers wanted systems enforcing explicit authority scoping, provenance, uncertainty signaling, and least-privilege access throughout.
Reported constraints and desiderata from the thematic analysis of survey responses (n=860).
Developers wanted systems that embed quality signals earlier in their workflow to keep pace with accelerating code generation.
Thematic findings from the paper's human-in-the-loop, multi-model council-based analysis of survey responses (n=860).
Using a human-in-the-loop, multi-model council-based thematic analysis, we identify 22 AI systems that developers want built across five task categories.
Qualitative analysis method described in the paper applied to the survey responses (n=860); result reported as identification of 22 desired AI systems organized into five categories.
For listed firms, AI patents command a robust market-value premium in both countries.
Firm-level analysis linking AI patenting to market valuation for listed firms in both countries (regression or valuation analysis implied by statement).
China surpasses the United States in recent annual AI patent counts.
Time-series patent count comparison using classifier-applied corpora (paper reports that recent annual counts are higher for China than the U.S.).
There is broad convergence in AI patenting intensity and subfield composition between the United States and China.
Comparative analysis of AI patenting intensity and subfield composition across the two patent corpora (US 1976-2023, China 2010-2023) reported in paper.
Applying the classifier to granted U.S. patents (1976-2023) and Chinese patents (2010-2023), we document rapid growth in AI patenting in both countries.
Application of classifier to full corpora of granted U.S. patents (1976-2023) and Chinese patents (2010-2023); time-series counts of AI patents reported.
The classifier generalizes well to Chinese patents based on citation and lexical validation.
Validation analyses described as citation-based and lexical validation applied to Chinese patents (paper states generalization to Chinese patents via these validation methods).
Our classifier substantially improves the existing USPTO approach, achieving 97.0% precision, 91.3% recall, and a 94.0% F1 score.
Reported classifier evaluation metrics (precision, recall, F1) presumably on held-out test data; comparison stated against the existing USPTO approach.
We develop a high-precision classifier to measure artificial intelligence (AI) patents by fine-tuning PatentSBERTa on manually labeled data from the USPTO's AI Patent Dataset.
Methodological description in paper: fine-tuning PatentSBERTa on manually labeled USPTO AI Patent Dataset (manually labeled training data and model fine-tuning stated).
The results demonstrate the importance of considering interacting systems of AI agents when doing both capabilities and safety research.
Authors' interpretation/generalization based on experimental findings comparing multi-agent organizations and single agents across tasks and settings.
BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility.
Design claim in abstract describing the benchmark's automated scoring system and rubric size (100+ criteria) defined by expert bankers.
Substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments.
Conclusion drawn from the authors' empirical study and the reported final-system performance; presented as a general methodological claim (supporting data referenced in paper but not detailed in excerpt).
The final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase.
Quantitative performance result reported for the mature phase of the system in the paper's abstract; Sharpe ratio provided as a single-number metric (no sample size, number of trading periods, or statistical significance reported in the excerpt).
The MAS abandoned overfitted hallucinations in favor of the Strict Test-Driven Agentic Workflow (STDAW), which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified ≥95% code coverage constraint matrix.
Design and outcome claim in the paper: introduction of STDAW/RO-Lock and reported enforcement of a ≥95% code coverage constraint as part of the aligned architecture (qualitative + a coverage threshold stated).
The system evolved from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture over the course of the study.
Reported longitudinal observations from the 20-month empirical study described in the paper (qualitative system evolution claim; no numeric counts provided in excerpt).
We introduce Out-of-Money Reinforcement Learning (OOM-RL): deploying agents into the non-stationary, high-friction reality of live financial markets to utilize capital depletion as an un-hackable negative gradient.
Methodological claim / novel paradigm introduced by the paper; described as implemented in the study (no numerical sample size given in excerpt).
Established regional telcos and banks are leveraging proprietary data to develop digital loan products.
Observations and interviews from the nine-month ethnography describing practices of regional telcos and banks in Nairobi developing digital loan products using proprietary data.
For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict
Explicit reproducibility statement and URL provided in the paper.
SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process?
Statement of research goals and scope in the paper introducing the SciPredict benchmark and accompanying evaluations.
Human experts demonstrate strong calibration: their accuracy increases from ≈5% to ≈80% as they deem outcomes more predictable without conducting the experiment.
Reported stratified accuracy of human experts on SciPredict tasks by self-reported predictability judgments; accuracy rises from ≈5% (when judged not predictable) to ≈80% (when judged predictable).
We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry.
Construction of the SciPredict benchmark described in the paper; explicitly reports 405 tasks and 33 sub-fields.
The future of Nagpur's industrial belt depends not on resisting automation, but on an aggressive reskilling strategy to bridge the gap between current workforce capabilities and future technological requirements.
Normative policy conclusion in the paper recommending reskilling as the primary response; based on the paper's analysis of task changes and projected role shifts; no program evaluation or empirical evidence of reskilling effectiveness reported in the excerpt.
There is a projected surge in demand for 'AI-collaborative' roles such as machine maintenance, data supervision, and process optimization.
Projection in the paper based on analysis of task complementarities between humans and AI, listing specific roles expected to grow; no quantitative demand estimates or sample sizes provided in the excerpt.
The paper documents 14 deliberate conservative assumptions — including frozen base GDP, no AI-on-AI compounding, a permanent friction floor, and conservative capture rates — all of which directionally understate the benefit.
Paper lists 14 conservative modeling assumptions and claims they bias results downward (i.e., understate potential benefits).
Even excluding demand expansion and robotics layers entirely, the direct productivity contribution alone reaches approximately $940 billion per year by 2036.
Model output reported in the paper when removing demand expansion and robotics layers.
In all four scenarios, cumulative net GDP exceeds cumulative AI infrastructure investment before 2036, with the base case achieving payback in 2033.
Model financial calculation comparing cumulative net GDP uplift to cumulative AI infrastructure investment across scenarios; explicit payback year reported for base case.
The base-case scenario yields approximately $1,057 billion in net annual GDP uplift by 2036, equivalent to 3.6 percent of 2024 GDP; the bear case produces $796 billion, the bull case $1,368 billion, and an agentic scenario produces $2,521 billion.
Model scenario outputs presented in the paper (four scenarios differentiated by capture rate and friction assumptions).
Sector-specific productivity gain percentages are anchored to published evidence, including a randomized controlled trial of GitHub Copilot (Kalliamvakou et al., 2023), JPMorgan CEO disclosures, and Cognizant's New Work New World 2026 research.
Paper states productivity percentages are anchored to published evidence and specifically cites Kalliamvakou et al. (2023) RCT, JPMorgan CEO disclosures, and Cognizant (2026).
A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity.
Design/implementation claim in paper describing deployment approach using YAML configuration rather than engineering work.
We introduce governability — how reliably a system knows when it should not act autonomously — as a primary evaluation axis for institutional AI alongside accuracy.
Conceptual contribution/metric proposed by authors in paper; no empirical validation reported in the excerpt.
Cognitive Core produced zero silent errors while both baselines produced 5-6 silent errors on the evaluation set.
Empirical benchmark reported in paper on the 11-case evaluation set; counts of silent errors given for Cognitive Core and baselines.
Cognitive Core achieves 91% accuracy on the 11-case prior authorization appeal set, versus 55% for ReAct and 45% for Plan-and-Solve.
Empirical benchmark reported in paper on the 11-case evaluation set; accuracies explicitly stated for three systems.
We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences.
Design/proposal described in paper (architectural specification); no empirical evaluation reported for the architecture itself in the excerpt.
Organisations should invest in customisation capabilities for AI recruitment tools, implement comprehensive change management strategies, and maintain robust post-hire evaluation procedures.
Authors' recommendations derived from thematic findings and participant perspectives across two firms (qualitative synthesis of n = 22 interviews).
AI functioned optimally as an augmentative technology rather than as a replacement for human decision-makers in recruitment.
Findings: participants across the two case firms described AI being most effective when augmenting human judgment rather than replacing it (interviews n = 22).
AI significantly enhanced efficiency through process standardisation and automation.
Findings based on participant accounts in thematic analysis (interviews n = 22) describing process optimisation and automation benefits.
The Principle of Maximum Heterogeneity reveals a convergence of complex phenomena across fields onto simple underlying design principles with important predictive value for future distributed production systems.
Synthesis claim in the paper arguing cross-field convergence and predictive value based on the theoretical model and conceptual examples; no empirical validation or forecasting trials reported.
The principles derived (including the Principle of Maximum Heterogeneity) can be used as a blueprint for constructing ideal distributed production systems; demonstrated by suggesting specific redesigns for compute systems executing large-scale AI.
Paper includes suggested redesigns for compute systems as demonstrations of the blueprint; these are proposed designs/illustrative applications rather than empirically validated interventions or trials.
The Principle of Maximum Heterogeneity applies recursively across all layers of nested production systems.
Theoretical claim within the paper arguing recursive applicability across nested system layers (e.g., neurons, firms, ecosystems); supported by conceptual reasoning and model exposition rather than empirical multi-layer tests.