Evidence (14055 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
The paper examines operational logic, defining features and emerging use cases of agentic payments across retail, e-commerce and decentralised finance.
Stated scope in the abstract; analysis and case-study-driven review across specified sectors (retail, e-commerce, DeFi). No sample sizes reported.
Agentic payments refer to transactions initiated and completed by AI agents without direct human intervention.
Explicit definitional statement in the abstract (conceptual definition provided by the authors).
Current evidence does not support the simple claim that autonomous code generation automatically improves engineering outcomes.
Synthesis of mixed results from controlled studies, meta-analyses, and benchmarks reported in the paper (no single sample size given in abstract).
However, the exoplanet workflow is effectively tied with a strong combined-summary baseline, showing that decomposition does not always improve top-line performance.
Reported comparison between the coordinated workflow and a strong combined-summary baseline for exoplanet vetting indicating no meaningful improvement.
All [the listed orchestration frameworks] follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn.
Author assertion based on architectural analysis of the listed frameworks (observation of orchestration pattern in the named projects).
The study used established measurement scales to assess AI-driven learning culture, knowledge orchestration, organisational intelligence and innovation performance.
Methods: authors report use of established scales for AIDLC, KO, OI and IP in the questionnaire.
Structured questionnaires were distributed between March and October 2025 to employees involved in innovation, learning and project management roles in Karachi, Lahore and Islamabad.
Methods section description of data collection period, target respondent roles, and cities covered.
Most respondents held undergraduate or postgraduate degrees in computer science, engineering or business-related disciplines.
Sample demographic summary from the survey (N=348).
After screening the data, 348 valid responses were analyzed.
Structured questionnaires distributed March–October 2025 to employees in medium and large IT firms in Karachi, Lahore and Islamabad; screening produced 348 valid responses (sample description in methods).
The paper draws on empirical studies from 2024–2026.
Methodological statement in the paper specifying the time window of empirical studies used in the analysis.
This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks.
Comparative evaluation reported in the paper showing that single-threshold (binary) scoring metrics do not exhibit the inverse-scaling pattern observed with tail-inclusive distributional metrics (specific metrics and calculations not given in excerpt).
Domain knowledge does not reliably rescue calibration.
Experiments reported in the paper where domain-knowledge interventions (procedures or prompts incorporating domain knowledge) were applied and did not consistently improve forecast calibration (details not provided in excerpt).
Using large language models, we measure the AIO level of Chinese listed companies from 2010 to 2023.
Authors report constructing firm-level measures of artificial intelligence orientation (AIO) by applying large language models to corporate texts/disclosures for Chinese listed companies over the 2010–2023 period.
We compared the traits causing the incidents with the traits that 197 developers building AI systems for those tasks would have preferred.
Study design: comparison between trait set responsible for incidents (from incident reports) and stated developer preferences collected from a sample of 197 developers working on those tasks.
We compared the extracted traits with the traits that 202 workers highly familiar with those tasks would have preferred.
Study design: a comparison between LLM-extracted traits from incident reports and stated preferences from a sample of 202 workers familiar with the tasks.
We used an LLM-as-an-expert approach to extract the main traits of the AI systems involved in those incidents using an established framework of twelve traits.
Methods statement: applied a Large Language Model to code/extract AI system traits from the incident reports using an established 12-trait framework.
We analyzed 1,524 reports of incidents in which AI systems were used to perform 171 occupational tasks across 12 industry sectors.
Descriptive statement in paper: dataset comprised 1,524 incident reports, covering 171 occupational tasks and 12 industry sectors (dataset construction / corpus used for analysis).
This study provides the first cross-class synthesis covering raw materials, work-in-process, and finished goods within a unified evaluative framework, positioning machine learning and deep reinforcement learning methods alongside classical policy families and quantifying the boundary conditions for each approach.
Author-stated theoretical contribution and scope of the review (coverage of raw materials, WIP, finished goods and methods).
A random-effects model estimated by restricted maximum likelihood was applied to pool percentage cost-reduction effect sizes across 18 studies admissible to quantitative synthesis.
Methods reported in the paper: random-effects meta-analysis using REML across 18 studies eligible for quantitative pooling.
A systematic review and meta-analytic synthesis of 31 peer-reviewed studies published between 2004 and 2025 was conducted following the PRISMA 2020 protocol.
Study methods reported in the paper: systematic review following PRISMA 2020; sample of 31 peer-reviewed studies dated 2004–2025.
Çalışmada yapay zekâ göstergesi olarak yapay zekâ patent sayıları (AI patent counts) kullanılmıştır.
Metodolojik açıklama: bağımlı değişken olarak AI patent sayıları kullanımı; veri: G8 ülkeleri + Türkiye, 2010-2020.
The study uses PyQu to quantify changes across five quality attributes for Python code.
Methodological description: application of PyQu (an ML-based quality assessment tool for Python) to measure five quality attributes before and after refactoring edits.
From the observed diffs, we derive a taxonomy of 24 recurring change operations.
Manual/automated analysis of diffs from the studied agentic refactoring PRs to identify and categorize recurring change operations into a 24-item taxonomy.
We will release the reanalysis pipeline to support replication.
Authors' statement of intent in the paper to release code/pipeline for replication.
In offensive cybersecurity, the marginal benefit of Skills collapses: the spread between the no-Skills and full-Skills conditions is only 8.9 percentage points (p = 0.71, χ²; p = 0.25, Cochran–Armitage trend test; five of six pairwise Cohen's h values fall below the 0.2 small-effect threshold).
Statistical re-analysis of the 180-run CTF study comparing no-Skills vs full-Skills conditions: reported spread = 8.9 percentage points; reported p-values from χ² and Cochran–Armitage trend tests; reported Cohen's h comparisons.
Those four documentation conditions correspond almost exactly to a No-Skills, Experiential-Skills, Curated-Skills, and Comprehensive-Skills ablation.
Authors map the four documentation-line-count conditions from the re-analyzed study to skill-ablation categories (No/Experiential/Curated/Comprehensive) as part of their interpretive re-analysis.
We re-analyze a recently published 180-run controlled study of an MCP-grounded autonomous Capture-the-Flag (CTF) agent under four documentation conditions (55, 1,478, 1,976, and 4,147 lines).
Authors' re-analysis of an existing controlled study consisting of 180 runs and four documentation conditions with the stated line counts; this is a descriptive claim about the re-analysis dataset and experimental conditions.
Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate.
Empirical evaluation: 660 trials run using Claude Code on the minimal-pair repos with hidden tests; reported comparison of pass rates between clean and messy repo variants showing no change.
Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity.
Measurement approach stated in the abstract (unified rubric with listed dimensions).
The study uses three LLM systems: ChatGPT, Claude, and Grok.
Method description in the paper's abstract naming the three LLMs evaluated.
The evaluation covers four task types: summarization, planning, explanation, and coding.
Method description in the paper's abstract listing the four task types used for evaluation.
The study compares three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt.
Experimental design described in the paper (three prompt conditions stated in the abstract).
Large language models (LLMs) are widely used for open-ended tasks.
Stated as background/context in the paper's introduction; no quantitative data reported in the abstract.
We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao.
Statement of experimental methodology describing the types of evaluations performed (public datasets, simulated auctions, and online deployment).
This study used a controlled mixed-design experiment with 60 participants who completed analytical survival ranking tasks in multi-turn human–AI collaborations, with pre/post measurements and two types of prompting training (general or sycophancy-focused).
Methodological description in the paper's abstract/summary.
Reported empirical values are transformed through transparent indicators such as relative growth, CAGR, growth multipliers, stock-flow ratios, concentration ratios, and HHI.
Methodological description and application in the paper listing these specific indicators used to summarize public data on AI investment, adoption, robots, compute, and labour-market reallocation.
The study uses a conceptual-empirical quantitative diagnostic design rather than a causal econometric model.
Explicit methodological statement in the paper describing the design choice and rejecting causal econometric modeling in favor of diagnostics using public institutional data and transparent indicators.
The agentic economy is not yet a completed global order, but its transition pressure is measurable enough to require a distinct economic vocabulary, reproducible diagnostics, and future sector-level measurement.
Synthesis of diagnostic indicators (AI investment/adoption trends, robot stock, compute-energy coupling, labour reallocation measures) showing measurable transition pressures; conclusion drawn from the conceptual-empirical diagnostic.
Following PRISMA 2020 guidelines, searches across Google Scholar, Web of Science, Scopus, ScienceDirect, and CNKI yielded 1,562 initial records, of which 21 studies published between 2019 and 2026 met inclusion criteria.
Methodological description of the systematic literature review reported in the paper: initial records = 1,562; included studies = 21; publication years 2019–2026.
Small and medium-sized enterprises (SMEs) constitute over 98.5% of businesses in many economies including China.
Descriptive statistic reported in the paper's background/intro; source of the statistic not specified within the summary provided.
This study analyzes developments through April 2026.
Explicit timeframe statement in the paper's summary/introduction.
The authors provide source code for their framework on GitHub to encourage further research.
Statement in the paper that the source code is available on GitHub; verifiable by visiting the repository (link not provided in the excerpt).
Heuristics such as TSP and PNN are commonly used as inexpensive approximations for customer trajectories.
Descriptive claim about common practice cited in the paper; used as motivation for proposing the RL approach (no quantitative survey evidence provided in the excerpt).
We conducted a randomized controlled experiment in which participants—analogs of early-career knowledge workers—were assigned to self-study a technical domain using either traditional resources or large-language-model (LLM) assistance.
Statement of experimental design in the paper (randomized controlled experiment assigning participants to either traditional resources or LLM assistance; participants described as analogs of early-career knowledge workers).
Results remain robust across checks.
Robustness checks reported by the authors (unspecified in abstract) that do not overturn the main findings.
China's 14th Five Year Plan (FYP) is used as a quasi-natural experiment / strategic policy shock to study effects of AI washing.
Research design leverages the FYP announcement as an exogenous policy shock in a difference-in-differences framework (design claim; no sample size in abstract).
AI washing is identified as the residual between AI narrative intensity and patent output.
Constructed a firm-level AI washing proxy by regressing AI narrative intensity on patent output and using the residual; described as the study's measurement approach (no sample size reported in the abstract).
Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.
Prescriptive conclusion derived from the observed cross-configuration heterogeneity in the paper's empirical results.
Framework identity accounts for more of the between-configuration variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%.
Variance decomposition / explained-variance analysis reported for 'mean turns' across configurations (reported percentages: 64% vs 10%).
The analysis separates framework effects from LLM effects by holding each layer fixed in turn and measures one behavior–outcome effect per configuration to examine agreement across configurations.
Methods description in the paper: experimental design holding LLM or framework fixed to disentangle effects.