Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Artificial intelligence (AI) has a positive but weaker impact on sustainable development relative to digital transformation, reflecting its complementary and maturity-dependent role within the digital ecosystem.
Same System GMM regressions on panel of MENA economies (2010–2023) that include measures of AI and digital transformation; reported positive but smaller coefficient for AI.
Digital transformation is the primary driver of sustainable development in MENA economies, exerting a stronger and more consistent effect than AI.
Dynamic panel data analysis of MENA economies (2010–2023) using System GMM; reported comparative effect sizes of digital transformation vs. AI in regression results.
In the ICT industry, Tobin's Q significantly increased following AI adoption (heterogeneous positive effect).
Subgroup/heterogeneity analysis within the main sample (KOSDAQ firms 2018–2025), estimating the post-adoption effect of AI on Tobin's Q in firms classified as ICT.
The authors propose corresponding analytical extensions to the framework to address the three structural breaks in agentic systems.
Paper presents proposed analytical extensions (methodological proposals) tied to each identified structural break.
Cross-architecture comparison reveals a governance coverage gradient: deterministic rule engines achieve full DES-property fillability.
Analytic cross-architecture comparison reported in the paper (comparative analysis across four architectures); deterministic rule engines identified as achieving 'full' fillability of DES-properties.
The paper synthesizes an operational governance evidence framework composed of: structural accountability collapse diagnostics, decision trace schemas, evidence sufficiency measurement, and label-free monitoring, integrated into a chain.
Methodological contribution: authors construct and present a synthesized framework from those four components (conceptual/analytical synthesis).
The Barcelona Declaration offers a promising forum for boundary governance.
Policy recommendation pointing to an existing initiative (Barcelona Declaration) as a suitable forum; stated without empirical evaluation in the excerpt.
Governance should calibrate the annulus, not abolish it: thin enough to serve research efficiently, wide enough to sustain innovation.
Normative policy recommendation from the authors; based on their conceptual framework rather than on empirical policy evaluation in the excerpt.
Artificial intelligence reshapes the annulus by lowering barriers to basic structuring.
Conceptual claim in the paper; asserted as an effect of AI on metadata production without empirical estimates in the excerpt.
The proposed framework is intended to serve as a practical reference for engineering teams and decision-makers navigating enterprise LLM adoption.
Author statement of intent in the paper (qualitative claim about intended audience and utility).
The buy-versus-build decision should be viewed as a phased continuum: initial API adoption can give way to hybrid architectures as organizational maturity and requirements evolve.
Conceptual argument in the paper, illustrated by the Bills Converter experience (single-case narrative recommending phased/hybrid progression).
In the end-to-end development of the Bills Converter, the authors chose a closed-source, API-based approach over self-hosted or custom-built alternatives.
Case study: the Bills Converter system (single end-to-end project described in the paper).
This paper presents a multi-dimensional decision framework that synthesizes technical, financial, and strategic considerations into a coherent evaluation methodology for enterprise LLM adoption.
The paper is explicitly framed as presenting a decision framework; supported by conceptual synthesis and exposition within the manuscript (no reported quantitative validation).
At the country level, digitalisation and workplace training provision steepen the exposure–adoption gradient.
Country-level heterogeneity analysis using the 2024 EWCS (35 countries) linking national measures of digitalisation and prevalence of workplace training to stronger occupational exposure–adoption relationships.
Individual skills, non-routine cognitive job content within occupations, and employee say in organisational decisions steepen the exposure–adoption gradient.
Interaction and stratified analyses from the 2024 EWCS showing stronger exposure–adoption associations among workers with higher individual skills, more non-routine cognitive job content (within occupations), and greater employee influence over organisational decisions; sample >36,600 workers.
Occupational exposure strongly predicts uptake.
Associational/regression analysis using the 2024 EWCS linking occupation-level measures of AI exposure to individual-level self-reported adoption; sample >36,600 workers across 35 countries.
Adoption averages 12% but ranges from under 3% to 25% across countries.
Descriptive analysis of the 2024 European Working Conditions Survey (EWCS), sample of more than 36,600 workers in 35 countries; country-level tabulations of self-reported generative AI adoption.
Our baseline model finds evidence that AI is productivity enhancing.
Results from the paper's stated baseline empirical model using BEA industry-account-based measures; model specification described by authors.
States can adjust their foreign policies to this fact by focusing on resilience, technological sovereignty, strategic decoupling, and coordination through alliances.
Policy-prescriptive recommendations based on the paper's theoretical framework and analysis; no empirical testing or sample size reported in the abstract.
ClawNet enables multiple users to collaborate securely through their respective agents.
Capability claim about the instantiated system (authors assert that ClawNet enables secure multi-user collaboration; excerpt contains no empirical security evaluation or user study).
We instantiate this paradigm in ClawNet, an identity-governed agent collaboration framework that enforces identity binding and authorization verification through a central orchestrator.
Implementation claim: authors state they built ClawNet as an instantiation of their paradigm (paper describes framework/architecture; no experimental evaluation included in excerpt).
Action-level accountability logs every operation against its owner's identity and authorization, ensuring full auditability.
Design claim describing an accountability primitive (paper asserts logging and auditability as a property; no audit or verification evidence shown in excerpt).
Scoped authorization enforces per-identity access control and escalates boundary violations to the owner.
Design/specification claim describing the scoped authorization governance primitive in the proposed paradigm (no empirical or security evaluation provided in excerpt).
The paradigm rests on three governance primitives: (1) a layered identity architecture that separates a Manager Agent from multiple context-specific Identity Agents; the Manager Agent holds global knowledge but is architecturally isolated from external communication.
Architectural/design claim describing the proposed layered identity primitive (presentation of design; no empirical validation in excerpt).
We propose a human-symbiotic agent paradigm in which each user owns a permanently bound agent system that collaborates on the owner's behalf, forming a network whose nodes are humans rather than agents.
Design proposal / conceptual architecture presented in the paper (no large-scale deployment or empirical evaluation described in excerpt).
The next frontier for AI agents lies not in stronger individual capability, but in the digitization of human collaborative relationships.
Normative/strategic claim advanced by the authors as the central thesis (conceptual argument, no empirical test reported).
Human productivity rests on the social and organizational relationships through which people coordinate, negotiate, and delegate.
Theoretical/argumentative claim presented as background motivation (conceptual reasoning, citation not provided in excerpt).
Time Series Augmented Generation (TSAG) enables LLM agents to delegate quantitative tasks to verifiable external tools.
Description of TSAG framework in paper stating delegation mechanism to external verifiable tools for quantitative computations.
We publicly release the evaluation framework and empirical insights to foster standardized research on reliable financial AI.
Paper states that the framework, benchmark, and empirical results are released publicly by the authors.
The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm.
Empirical results from the authors' experiments on the 100-question benchmark across multiple agents; paper states agents achieve 'near-perfect' tool-use accuracy and 'minimal' hallucination.
We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools.
Paper reports applying the TSAG framework in an empirical study in which agents call external tools to perform quantitative computations; described as 'large-scale' and implemented by the authors.
We introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis.
Paper describes a new methodology and benchmark (Time Series Augmented Generation, TSAG) developed by the authors for evaluating LLM reasoning on financial time-series tasks.
Effective evaluation-driven loop scaling is a central axis for advancing LLM-driven scientific discovery, and SimpleTES provides a simple yet practical framework for realizing these gains.
High-level claim supported by the aggregate experimental results and discussion in the paper.
When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover.
Experiments in which models were post-trained on successful SimpleTES trajectories and evaluated on both seen and unseen problems (paper claim of improved efficiency and generalization).
SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning.
Methodological claim and supporting experiments where SimpleTES generates solution trajectories that are then used as supervision for learning.
We discovered new Erdos minimum overlap constructions that surpass the best-known results.
Reported novel combinatorial constructions (Erdos minimum overlap) in the experiments that improve on prior best-known results.
We designed quantum circuit routing policies that reduce gate overhead by 24.5%.
Experimental results reported for quantum circuit routing tasks showing a 24.5% reduction in gate overhead when using SimpleTES-designed policies.
We sped up the widely used LASSO algorithm by over 2x.
Benchmarking experiment reported in the paper comparing LASSO runtime/performance with and without SimpleTES (paper states >2x speedup).
SimpleTES consistently outperforms both frontier-model baselines and sophisticated optimization pipelines.
Comparative experimental evaluation vs. frontier-model baselines and optimization pipelines across the reported problems (paper claim).
Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models.
Empirical experiments reported across 21 problems in six domains using gpt-oss models (paper states 21 problems).
We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection.
Methodological contribution described in the paper (framework design and algorithmic description).
Given historical inequities in housing placement, it is crucial to audit LLM use in this context.
Authors' policy/recommendation motivated by historical inequities in housing placement and their empirical audit findings; presented as an argument in the report rather than a quantified experimental result.
Leveraging LLMs to augment tabular classification with casenote summaries can safely incorporate additional text information with low implementation burden.
Authors' reported experiments and practical assessment on augmenting tabular classifiers with LLM-derived casenote summaries from a nonprofit outreach dataset; described as having low implementation burden and being safe to use. (No sample size given in abstract.)
A fine-tuned model augmented with casenote summaries can improve accuracy while reducing algorithmic fairness disparities on the housing placement multi-class classification task.
Empirical audit of LLM-based tabular classification on a real housing placement prediction task augmented with street outreach casenotes from a nonprofit partner; authors report multi-class classification experiments comparing fine-tuned models with and without casenote summaries and auditing error disparities across groups. (Sample size not stated in the abstract.)
There is a positive relationship between disagreement among agents and trading volume in the simulated markets.
Observed correlation in the simulated open-call auction between measured disagreement (e.g., dispersion in beliefs) and trading volume; described as replicating classic experimental findings.
These individual-level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices.
Aggregation of simulated agent behavior in the open-call auction producing market-level time series; comparison of market dynamics to classic experimental benchmark (Smith et al., 1988) and reported finding that excess demand predicts future prices.
AI agents form recency-weighted extrapolative beliefs (i.e., overweight recent price history when forecasting future prices).
Analysis of agents' forecasts and trading behavior in the simulated open-call auction populated by autonomous LLM agents; identification of extrapolative forecasting patterns reported as a main finding.
AI agents exhibit a pronounced disposition effect.
Simulated open-call auction populated by autonomous LLM agents in experimental asset-market simulations; behavioral trading data showing agents' selling/holding patterns (paper describes this as a main documented finding).
We propose seven interface primitives operationalizing verification-centered HCI.
Design contribution: specification of seven interface primitives within the paper (conceptual/design proposal); no user-study or empirical validation reported.
We map synthetic literacy -- oral input generating literate output -- as the defining feature of this transition.
Conceptual mapping and theoretical framing within the paper; supported by examples from technology trends but no empirical evaluation reported.