Evidence (8570 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Adoption
Remove filter
The model shows when these systems become vulnerable to strategic use from within government.
Analytical result derived from the paper's formal theoretical model (no empirical validation reported).
The compliance layer can also create a stable approval boundary that political successors learn to navigate while preserving the appearance of lawful administration.
Stated conclusion/insight from the paper's formal argument and conceptual framing (theoretical, no empirical sample).
The near-uncorrelated rankings and rank shifts on the n=11 subset are driven by a strong negative Adoption-Capability correlation among closed-source high-capability agents within this subset.
Subgroup analysis/observation within the 11-agent SWE-bench overlap indicating a negative correlation between Adoption and Capability for closed-source high-capability agents (no numerical coefficient reported in the excerpt).
Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment.
Conceptual statement in the paper; no empirical sample cited for this specific claim (framing/argumentation).
Self-assessment is a key bottleneck for market-style coordination of AI agents.
Conclusion drawn from empirical results (miscalibration findings, auction divergence, modest improvement from prior-information intervention) reported in the paper.
Auctions built from these self-reports diverge from a full-information allocation.
Simulation or empirical auction experiments using self-reported signals from the six LLMs on the 93 tasks, compared to a full-information allocation benchmark (method described in paper).
These LLMs are miscalibrated on both success probability and token usage.
Empirical evaluation of six LLMs on 93 SWE-bench Lite tasks assessing calibration of predicted success probabilities and token usage (as reported in the paper).
Standard PayGo degrades substantially under classroom-scale concurrency.
Empirical latency measurements and comparative analysis across throughput tiers and concurrency levels in the instrumented deployment.
Each student query triggers several concurrent API calls whose latencies compound through a parallel-phase maximum effect that single-agent systems do not face.
Architectural description and instrumentation of the four-agent ITAS system (paper reports measurements and latency analysis across tiers and concurrency levels).
In the absence of intervention, individually rational adoption of genAI will assuredly and profoundly reduce collective welfare.
Conclusion drawn from the paper's theoretical model (normative/predictive claim based on model dynamics; no empirical validation or sample reported in abstract).
Habit formation around genAI use can couple otherwise separate domains, so that adoption in low-stakes tasks spills over into high-value tasks and amplifies welfare losses.
Theoretical/model-based claim showing coupling across domains via habit formation (model extension; no empirical sample reported in abstract).
The introduction of genAI—while initially beneficial at the individual level—will reduce social welfare for the most important types of tasks.
Model-derived result: theoretical analysis indicates social-welfare reductions in high-value tasks despite individual gains (no empirical sample reported in abstract).
Generative models are vulnerable to model collapse: when trained on data generated by earlier versions of themselves, their outputs can lose diversity and accuracy.
Theoretical claim / conceptual claim presented in the paper (no empirical sample size given in abstract); refers to degradation of model outputs when trained on self-generated data.
Frontier models fail to accurately predict their own token usage (with weak-to-moderate correlations, up to 0.39) and systematically underestimate real token costs.
Evaluation of models' self-predicted token cost versus realized token usage across agentic runs on SWE-bench Verified; reported correlations up to 0.39 and systematic underestimation bias.
Models vary substantially in token efficiency: on the same tasks, Kimi-K2 and Claude-Sonnet-4.5, on average, consume over 1.5 million more tokens than GPT-5.
Cross-model comparisons of average total token consumption per task run across the eight evaluated LLMs on SWE-bench Verified; paper reports average differential between named models and GPT-5.
Input tokens rather than output tokens drive the overall cost of agentic tasks.
Breakdown of token usage into input vs output token components from the analyzed agentic task trajectories on SWE-bench Verified (across the eight LLMs evaluated).
Agentic tasks are uniquely expensive, consuming 1000x more tokens than code reasoning and code chat.
Empirical measurement of token counts from agentic coding task runs compared to runs labeled as code reasoning and code chat across the evaluated trajectories (paper reports comparisons on SWE-bench Verified across eight frontier LLMs).
Each new task domain requires painstaking, expert-driven harness engineering: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective.
Author assertion in the paper's introduction/abstract describing the state of practice; no empirical method, dataset, or sample size reported in the excerpt.
A vulnerability class is characterised for expected-utility maximisers that makes them susceptible to adversarial gambles.
Formal characterization/definition and analytical derivation in the paper describing which expected-utility maximisers are vulnerable to adversarial (Pascal-type) offers; theoretical examples provided rather than empirical tests.
Industry digital maturity weakens the effect of the peer leader on a focal firm’s AI adoption.
Interaction/heterogeneity analysis in fixed-effects regression models on panel data of publicly listed Chinese firms (2012–2023), using an industry digital maturity moderator.
Current evaluation proxies are insufficient for predicting downstream human impact.
Empirical results in the paper showing decoupling between standard quantitative proxies (e.g., sparsity, faithfulness) and human outcomes (clarity, decision utility, confidence) across datasets and analyst reviews.
A highlighting policy that is optimal for sophisticated agents can perform arbitrarily poorly when deployed to naive agents.
Constructive worst-case examples and theoretical bounds in the paper demonstrating arbitrarily large performance degradation when applying sophisticated-optimal policies to naive agents.
Optimizing highlighting for sophisticated agents can be computationally intractable, even in simple discrete and binary settings.
Theoretical complexity results and proofs in the paper showing hardness of the optimization problem under the sophisticated-agent model; no sample/calibration required (formal/algorithmic analysis).
The regulatory architecture is in place; the verification instrument is not.
Paper's high-level diagnosis asserting that regulations establish obligations but lack a technical instrument for quantitative verification of acceptable risk.
The systems most in need of oversight are opaque statistical inference engines that resist white-box scrutiny.
Paper's characterization/analysis of contemporary high-risk AI systems as opaque statistical models that are difficult to inspect via white-box methods.
This gap is not theoretical: as the EU AI Act moves into full enforcement, developers face mandatory conformity assessments without established methodologies for producing quantitative safety evidence.
Argument in paper linking imminent enforcement of EU AI Act to practical conformity-assessment requirements for developers and asserting lack of established methodologies for quantitative safety evidence.
None [of these regulatory frameworks] specifies what 'acceptable risk' means in quantitative terms, and none provides a technical method for verifying that a deployed system actually meets such a threshold.
Paper's critical analysis of existing regulatory instruments, arguing absence of quantitative definitions and verification methods.
There is a stark geopolitical divide between 'AI Core' nations and the Global South; the Global South faces acute risks of 'Digital Dependency' and eroded digital sovereignty.
Cross-study synthesis in the systematic review (2018-2026) identifying geopolitical patterns and risks; abstract does not quantify the number of studies or present empirical effect sizes.
The 'black box' nature of automated systems undermines the democratic social contract and principles of procedural justice, epitomised by the Australian 'Robo-debt' scandal.
Case study material and literature synthesized in the systematic review referencing the Australian Robo-debt case as an exemplar; abstract does not provide primary data or sample sizes.
Agentic AI introduces novel challenges related to market stability, regulatory compliance, interpretability, and systemic risk.
Survey discussion synthesizing literature on systemic and governance risks of autonomous systems in markets; draws on conceptual and empirical prior work but does not present new quantitative results.
Scalable AI tutoring for procedural skill learning requires structured knowledge representations, yet constructing these representations remains a labor-intensive bottleneck.
Background/claim made in the paper's introduction framing the problem; no specific quantitative evidence reported in the abstract.
Under-represented groups tend to be systematically under-observed because of historical exclusion and selective feedback, which exacerbates uncertainty for those groups.
Conceptual claim supported by illustrative examples (e.g., lending context) and simulations demonstrating selective feedback effects; literature citation likely included in paper.
Policies that ignore the unobserved (counterfactual) space can harm decision makers (via unrealized gains or losses) and subjects (via compounding exclusion and reduced access).
Theoretical argumentation and illustrative examples (e.g., loan denial counterfactuals) and modelled simulations showing downstream harms when ignoring unobserved outcomes.
Experiments on simulated data with varying bias show that unequal uncertainty and selective feedback produce disparities across groups.
Simulation experiments described in the paper manipulate bias and feedback patterns and report resulting group disparities (synthetic datasets; experiment details in methods/results sections).
Industrial firms face a dual challenge: (1) the development and deployment of digital technologies and (2) the proliferation and integration of the corresponding skills portfolios.
Conceptual framing and literature synthesis presented in the paper (identification by authors); not tied to a specific quantitative sample in the provided text.
Renewable energy adoption further reinforces the beneficial effect of digital trade on emissions under stronger regulatory stringency (mediation via renewable energy and regulation).
Structural equation modelling (SEM) on the monthly panel (38 OECD economies, 2000–2024) assessing mediation paths through renewable energy adoption and regulatory stringency; reported as reinforcing the digital trade effect.
There is a carbon-pricing threshold at USD 40 per tonne, above which emissions decline significantly (Δ = −15%, p < 0.01).
Carbon-pricing threshold analysis applied to the monthly panel of 38 OECD economies (2000–2024); threshold identified and associated pre/post comparison reports a 15% decline with p < 0.01.
The environmental effect of digital trade becomes stronger (more negative on emissions) when combined with AI-enhanced logistics (interaction effect).
Econometric models including interaction terms for AI-enhanced logistics and digital trade on the monthly panel (38 OECD economies, 2000–2024); interaction effects identified via regression and machine-learning threshold techniques.
GVC participation is significantly associated with lower CO2 emissions (β = −0.064, p < 0.01).
Econometric analysis on a monthly panel of 38 OECD economies from 2000–2024 using fixed-effects models; coefficient and p-value reported in paper.
Traditional forecasting and optimization approaches often operate in isolation, limiting their real-world effectiveness in volatile-demand, uncertain-supply industries.
Positioning/background statement in the paper motivating the integrated framework (literature-based claim).
The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable: each decision the agent makes is recorded directly in cells that belong to and reflect on the user.
Conceptual / domain-specific argument made by the authors (no empirical sample attached to the claim).
AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution: by the time users receive the output, all underlying decisions have already been made without their involvement.
Author assertion / conceptual description in the paper (no empirical quantification provided for this general statement).
Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution.
Author assertion / literature-level observation presented in the paper (no empirical sample reported for this claim).
A threat model taxonomy mapping misuse vectors to hardware, software, institutional, and liability layers illustrates why no single governance mechanism suffices.
Threat model taxonomy developed in the paper (conceptual taxonomy; illustrative mapping rather than empirical testing).
Restricting access to open-weight models deepens asymmetries while driving proliferation into unsupervised settings.
Argumentation and threat-model reasoning in the paper describing likely consequences of restrictions (theoretical analysis; no empirical sample cited).
Access restrictions, without governed alternatives, may displace risks rather than reduce them.
Theoretical argument and threat-model analysis in the paper showing possible risk displacement (conceptual reasoning; no empirical sample reported).
Selective forgetting remains underexplored compared to retention in LLM agent memory research.
Authors' literature survey / position statement in paper (assertion made in abstract).
Beyond technical barriers there are organizational ones: a persistent AI literacy gap, cultural heterogeneity, and governance structures that have not yet caught up with agentic capabilities.
Interview data (over 30) reporting organizational challenges including limited AI literacy, diverse cultural attitudes across organizations, and lagging governance relative to agentic AI capabilities.
Adoption is constrained less by model capability than by fragmented and machine-unfriendly data, stringent security and regulatory requirements, and limited API-accessible legacy toolchains.
Stakeholder interviews (over 30) reporting barriers to deployment; qualitative synthesis identifies data fragmentation, security/regulatory requirements, and legacy toolchain access as primary constraints.
Users push back against agent outputs -- through corrections, failure reports, and interruptions -- in 44% of all turns.
Turn-level coding of user behavior in the SWE-chat dataset: proportion of conversational turns containing correction/complaint/interrupt signals, computed across >63,000 user prompts and sessions.