Evidence (6507 claims)
Adoption
7395 claims
Productivity
6507 claims
Governance
5877 claims
Human-AI Collaboration
5157 claims
Innovation
3492 claims
Org Design
3470 claims
Labor Markets
3224 claims
Skills & Training
2608 claims
Inequality
1835 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 609 | 159 | 77 | 736 | 1615 |
| Governance & Regulation | 664 | 329 | 160 | 99 | 1273 |
| Organizational Efficiency | 624 | 143 | 105 | 70 | 949 |
| Technology Adoption Rate | 502 | 176 | 98 | 78 | 861 |
| Research Productivity | 348 | 109 | 48 | 322 | 836 |
| Output Quality | 391 | 120 | 44 | 40 | 595 |
| Firm Productivity | 385 | 46 | 85 | 17 | 539 |
| Decision Quality | 275 | 143 | 62 | 34 | 521 |
| AI Safety & Ethics | 183 | 241 | 59 | 30 | 517 |
| Market Structure | 152 | 154 | 109 | 20 | 440 |
| Task Allocation | 158 | 50 | 56 | 26 | 295 |
| Innovation Output | 178 | 23 | 38 | 17 | 257 |
| Skill Acquisition | 137 | 52 | 50 | 13 | 252 |
| Fiscal & Macroeconomic | 120 | 64 | 38 | 23 | 252 |
| Employment Level | 93 | 46 | 96 | 12 | 249 |
| Firm Revenue | 130 | 43 | 26 | 3 | 202 |
| Consumer Welfare | 99 | 51 | 40 | 11 | 201 |
| Inequality Measures | 36 | 105 | 40 | 6 | 187 |
| Task Completion Time | 134 | 18 | 6 | 5 | 163 |
| Worker Satisfaction | 79 | 54 | 16 | 11 | 160 |
| Error Rate | 64 | 78 | 8 | 1 | 151 |
| Regulatory Compliance | 69 | 64 | 14 | 3 | 150 |
| Training Effectiveness | 81 | 15 | 13 | 18 | 129 |
| Wages & Compensation | 70 | 25 | 22 | 6 | 123 |
| Team Performance | 74 | 16 | 21 | 9 | 121 |
| Automation Exposure | 41 | 48 | 19 | 9 | 120 |
| Job Displacement | 11 | 71 | 16 | 1 | 99 |
| Developer Productivity | 71 | 14 | 9 | 3 | 98 |
| Hiring & Recruitment | 49 | 7 | 8 | 3 | 67 |
| Social Protection | 26 | 14 | 8 | 2 | 50 |
| Creative Output | 26 | 14 | 6 | 2 | 49 |
| Skill Obsolescence | 5 | 37 | 5 | 1 | 48 |
| Labor Share of Income | 12 | 13 | 12 | — | 37 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Productivity
Remove filter
Modeled joules per correct answer varies by a factor of 6.2 across endpoints.
Modeled energy estimate combined with task accuracy to compute joules per correct answer across 78 endpoints.
Across 78 endpoints, the same model on different endpoints differs in tail latency by an order of magnitude.
Empirical tail-latency measurements across 78 endpoints serving 12 model families.
The same model on different endpoints differs in fingerprint similarity to first party by up to 12 points.
Empirical measurement of fingerprint (output-distribution) similarity to a first-party reference across the same set of endpoints (78 endpoints, 12 model families).
Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code.
Empirical measurement across 78 endpoints and 12 model families comparing mean accuracy on math and code tasks.
The rise of digital agents will transform the foundations of production, labour markets, institutional arrangements and the international distribution of economic power.
Synthesis and theoretical projection across sections of the paper; presented as a broad conclusion without reported empirical quantification in the provided text.
There is a fundamental asymmetry between economic and social reproduction: digital agents can compensate for productive functions of the population but are unable to substitute the population's functions of social reproduction.
Theoretical argument and conceptual distinction in the paper; no empirical study measuring substitution in social reproduction provided.
These patterns suggest that AI adoption is associated with expected efficiency gains that shape both firms' pricing behaviour and their macroeconomic expectations.
Interpretation based on observed increases in productivity/profitability and different pricing/inflation expectations among adopters vs non-adopters in survey and DID analyses.
AI adoption leads both to job displacement and job creation, including the emergence of new occupational categories.
Abstract states the review examines empirical evidence on both job displacement and creation and the emergence of new occupations; no numeric counts or sample sizes provided in abstract.
The study identifies short-term transitional risks and long-term productivity gains associated with AI integration in the workforce.
Abstract states the paper evaluates both short-term risks and long-term productivity gains from AI integration based on the reviewed literature; no empirical quantification given in abstract.
AI-driven automation and augmentation are reshaping employment landscapes, with emphasis on sector-level disruption, skill transformation, and socioeconomic consequences.
Abstract states this as a conclusion of the review drawing on interdisciplinary empirical literature; no specific studies or sample sizes cited in abstract.
The accelerating deployment of artificial intelligence across industries has fundamentally altered the structure of global labour markets.
Statement in abstract summarizing a systematic review of interdisciplinary literature (economics, computer science, organizational behaviour, public policy); no specific sample size reported in abstract.
The magnitude of AI’s effect on potential GDP varied across industries and depended on the level of digital maturity, human resources, and institutional conditions.
Decompositional analysis across aggregated industry data and scenario-based modeling drawing on sectoral sources and reviews.
Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated.
Error-mode analysis across the 105 tasks and evaluated models reported in experiments; authors identify task-family-level patterns (HR, management, multi-system workflows) and relative ease of local workspace repair.
Whether LLM-based assistants improve or degrade code quality remains unresolved: existing studies report contradictory outcomes contingent on context and evaluation criteria.
Review finds mixed/contradictory findings across included studies regarding code quality effects.
The system tends to be factually correct when it answers but often omits information (i.e., 'the system is right when it answers — it just leaves things out').
Interpretation combining reported factual accuracy (85.5%) with low completeness (0.40) from benchmark results.
The study establishes statistically significant relationships between organizational AI adoption and compensation dynamics.
Econometric estimates (difference-in-differences and propensity score matched comparisons) using the combined datasets listed in the paper and controlling for industry, firm size, geography, occupation characteristics, and macroeconomic variables.
The study establishes statistically significant relationships between organizational AI adoption and changes in occupational structures.
Same econometric approach (difference-in-differences and propensity score matching) applied to combined datasets (Anthropic Economic Index, Census Business Trends and Outlook Survey, Federal Reserve regional surveys, labor market analytics), with controls for industry, firm size, location, occupation-level characteristics, and macroeconomic environment.
The study establishes statistically significant relationships between organizational AI adoption and changes in employment patterns in the United States during 2022–2025.
Econometric analysis using multiple large-scale data sources (Anthropic Economic Index, U.S. Census Bureau Business Trends and Outlook Survey, Federal Reserve regional surveys, labor market analytics) and methods described as difference-in-differences estimation and propensity score matching controlling for industry (NAICS 2-digit), firm size, geography, occupation characteristics, and macro conditions.
The paper extends paradox theory to conceptualise the Creativity Paradox in the context of GenAI.
Theoretical extension and conceptual development within the paper (no empirical tests reported).
Within that n=11 subset, 9 of 11 agents shift by at least 2 ranks between composite and benchmark-only rankings.
Comparison of rank positions between composite and benchmark-only rankings on the 11-agent subset; reported count of agents that moved at least 2 ranks.
The four factors capture largely complementary information (n=50; ρ_max = 0.61 for Adoption-Ecosystem, all others |ρ| ≤ 0.37).
Correlation analysis among the four factor scores computed on the 50-agent sample; reported maximum inter-factor Pearson/Spearman correlation coefficients.
Provisioned Throughput delivers the lowest latency at low concurrency but saturates its reserved capacity above approximately 20 concurrent users.
Empirical measurements from the instrumented system across concurrency up to 50 users and tier comparisons; the paper reports the observed saturation point near ~20 concurrent users.
Delegating tasks to genAI can be individually beneficial in the short term even as widespread adoption degrades future model performance (creating a social dilemma).
Result of the paper's behavioral model showing an individual-level incentive to use genAI versus a collective cost from adoption (theoretical/model-based; no empirical sample reported in abstract).
Token usage is highly variable and inherently stochastic: runs on the same task can differ by up to 30x in total tokens.
Observed run-to-run variability in total token counts for identical tasks across the collected agentic trajectories from eight frontier LLMs on SWE-bench Verified.
ASC (adaptive stopping criterion) halts harmful refinement but incurs a 3.8 pp confidence-elicitation cost.
Reported experiment with ASC showing that it prevents harmful iterative refinement yet causes a measured cost described as 3.8 percentage points due to confidence elicitation.
Only o3-mini (+3.4 pp, EIR = 0%), Claude Opus 4.6 (+0.6 pp, EIR ~ 0.2%), and o4-mini (+/-0 pp) remain non-degrading under self-correction; GPT-5 degrades by -1.8 pp.
Reported measured changes in accuracy (percentage-point changes) and measured EIR values for the named models after applying iterative self-correction across the experiment suite.
Across 7 models and 3 datasets (GSM8K, MATH, StrategyQA), we find a sharp near-zero EIR threshold (<= 0.5%) separating beneficial from harmful self-correction.
Empirical experiments reported across 7 LLMs and 3 benchmark datasets (GSM8K, MATH, StrategyQA) comparing outcomes of iterative self-correction as a function of measured EIR.
These efficiency gains are offset by a growing 'Efficiency-Legitimacy Paradox' (i.e., improvements in efficiency come with worsening legitimacy concerns).
Conceptual synthesis from the systematic review (2018-2026) identifying a recurring trade-off across reviewed studies; specific empirical quantification not provided in abstract.
There is a structural shift from 'street level' bureaucracies to 'system-level' architectures that can be defined as the institutional division of 'Artificial Discretion' to algorithmic infrastructures.
Synthesis from the PRISMA-guided systematic review of literature (2018-2026) reporting observed changes in administrative architectures; specific studies not enumerated in abstract.
As a General-Purpose Technology (GPT), Artificial Intelligence (AI) is fundamentally reconfiguring state capacity, as well as the mechanics of global economic management.
Systematic review of current research studies (2018-2026) conducted following PRISMA guidelines; synthesis of literature claiming broad institutional and macroeconomic effects. Number of studies not specified in abstract.
For LLM agents, memory management critically impacts efficiency, quality, and security.
Statement in paper framing and motivation; supported conceptually by literature linking memory design to system properties (no specific experimental details provided in abstract).
Coding patterns are bimodal: in 41% of sessions, agents author virtually all committed code ("vibe coding"), while in 23%, humans write all code themselves.
Empirical analysis of authorship attribution across the 6,000 sessions in the SWE-chat dataset; percentages derived from session-level classification.
A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but DPM exposes one nondeterministic call while summarization exposes N compounding calls.
Determinism experiment with 10 replays per case at temperature zero; qualitative/quantitative observation about number of nondeterministic LLM calls exposed by each architecture.
Multi-agent workflows and benchmark evaluation reveal current capabilities, limitations, and research frontiers in agentic AI for physical design.
The paper states it analyzes recent experience with multi-agent workflows and benchmark evaluation; the abstract does not provide specific benchmark names, metrics, or sample sizes.
AI is associated with a shift toward younger, relatively less educated workers.
Reported association in the paper's baseline empirical results linking AI presence/pervasiveness to changes in workforce composition (age and education).
Given the results, educators should revisit pair programming as an educational tool in addition to embracing modern AI.
Authors' recommendation in the paper's conclusion based on experimental findings (performance, workload, emotion, retention outcomes).
Formal network verification has made substantial progress in proving correctness properties but is typically applied in offline, pre-deployment settings and faces challenges in accommodating continuous changes and validating live production behavior.
Authors' summary of the state of the art in network verification (assertion in paper; no empirical data in abstract).
Overall, the proposed HRL framework improves learning efficiency and scalability, outperforming heuristic baselines while remaining below the perfect-information oracle bound.
Results reported in the paper from simulation experiments comparing the HRL framework to heuristic baselines and the oracle; pairwise differences analyzed (Wilcoxon tests referenced). The paper asserts better performance than heuristics but still worse than the oracle.
How software developers interact with AI-powered tools, including Large Language Models (LLMs), plays a vital role in how these AI-powered tools impact them.
Based on qualitative analysis of twenty-two interviews with software developers about using LLMs for software development; asserted as a central finding in the paper's analysis.
Benefits of technology and data analytics are context-dependent, with emerging markets facing unique regulatory and infrastructural barriers.
Narrative synthesis of included studies noting heterogeneity by context and reports of regulatory/infrastructural constraints in emerging markets.
Cybersecurity has a moderating effect on audit data analytics.
Synthesis statement in the review summarizing included studies that report cybersecurity influences the effectiveness/usability of audit data analytics.
CLARITI matches GPT-5's resolution rate on underspecified issues while generating 41% fewer questions.
Empirical evaluation comparing CLARITI and GPT-5 on a task set of underspecified software engineering issues; the result reported in the abstract indicates parity in resolution rate and a quantified reduction in questions (41%) but the abstract does not report sample size, test set composition, or statistical significance.
They can produce fluent outputs that resemble reflection, but lack temporal continuity, causal feedback, and anchoring in real-world interaction.
Descriptive claim made in the text contrasting surface-level fluency with missing properties; no empirical data or experiments provided.
This work establishes a foundation for understanding how generative AI systems not only augment cognitive performance but also reshape self-perception and perceived expertise.
Paper's stated contribution presenting theory and conceptual groundwork; no empirical validation provided in the abstract.
The LLM fallacy has implications for education, hiring, and AI literacy.
Implications and argumentation presented in the paper; these are prospective and conceptual rather than supported by empirical data in the abstract.
Further research is needed to explore the longitudinal impact of these AI deployments on local labor markets and the creation of indigenous datasets that reflect Cameroon’s unique linguistic diversity.
Authors' identified research gaps and recommendations; statement of future research needs rather than empirical result.
Removing safety layers made the system less useful: structured validation feedback guided the model to correct outcomes in fewer turns, while the unconstrained system hallucinated success.
Qualitative and quantitative comparisons from the deployed evaluation across the three conditions (observations about turn counts, validation-feedback loops, and model hallucinations in unconstrained condition over the 25 scenario trials).
Across all settings, AI Organizations composed of aligned models produce solutions with higher utility but greater misalignment compared to a single aligned model.
Reported experimental results aggregated across two practical settings (AI consultancy and AI software team) and 12 tasks; direct comparison between AI Organizations of aligned models and a single aligned model.
Multi-agent "AI organizations" are simultaneously more effective at achieving business goals, but less aligned, than individual AI agents.
Experimental comparison reported in the paper: experiments comparing multi-agent AI organizations to single aligned agents across tasks and settings (described below).
Although some frontier models exceed human performance, model accuracy is still far below what would enable reliable experimental guidance.
Paper reports instances where top-performing (frontier) models outperform aggregate human expert accuracy on SciPredict, but concludes overall accuracies are insufficient for reliable experimental guidance.