Evidence (13870 claims)
Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 196 | 98 | 892 | 1984 |
| Governance & Regulation | 817 | 394 | 188 | 121 | 1544 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 627 | 233 | 123 | 96 | 1088 |
| Research Productivity | 411 | 123 | 56 | 332 | 933 |
| Output Quality | 467 | 178 | 59 | 47 | 751 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 167 | 122 | 24 | 496 |
| Task Allocation | 207 | 64 | 71 | 32 | 379 |
| Skill Acquisition | 165 | 59 | 60 | 17 | 301 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 52 | 107 | 13 | 279 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 150 | 48 | 26 | 3 | 227 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 63 | 20 | 12 | 184 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 93 | 21 | 13 | 19 | 148 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 17 | 7 | 3 | 59 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
We evaluate 18 proprietary and open source agents on 538 tasks.
Author-reported evaluation methodology and scale (number of agents and tasks) as stated in abstract.
Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion.
Author description of interaction protocol structure (design specification in paper abstract).
DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges.
Author statement in abstract describing the protocol (design/method contribution).
DeskCraft covers professional creative software across design, video, audio, and 3D creation.
Author statement in abstract listing covered software domains.
DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps.
Benchmark design described in abstract (explicit statement that long-horizon tasks require over 50 execution steps).
We introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration.
Author statement describing the new benchmark (benchmark design and scope described in paper).
Taiji demonstrates robust scalability in web-scale environments.
Assertion supported by deployment at large scale and claimed daily user numbers (paper's scalability claim).
Taiji yields significant commercial revenue.
Claim in the paper that the deployed system produces notable commercial revenue (no quantitative revenue figures provided in abstract).
Taiji currently serves over 400 million users daily.
Operational usage statistic reported in the paper (daily active users served).
Taiji has been deployed on Kuaishou's advertising platform since May 2026.
Deployment statement reported in the paper (operational deployment date claimed).
Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji.
Empirical evaluation section claiming extensive offline experiments and online A/B testing validate the system (no numeric details in abstract).
Theoretically, POPO achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences.
Theoretical analysis/claims in the paper asserting optimality of the proposed POPO method (presumably proofs or propositions).
To resolve the RL alignment issue, Taiji proposes Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights.
Algorithmic contribution described in the paper introducing POPO as an RL method for adaptive reward-weighting.
To overcome the SFT bottleneck, Taiji utilizes reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific chain-of-thought (CoT) data.
Methodological description in the paper detailing reverse-engineered reasoning and open-ended rejection sampling as data-generation techniques.
We present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems.
Paper's core contribution: proposal of a new framework described in the manuscript.
Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry.
Framing statement in the paper's introduction/abstract asserting industry trend (literature/industry observation).
The paper constructs firm-level indicators of artificial intelligence and new quality productive forces for new energy vehicle firms.
Authors state they constructed firm-level indicators as part of their empirical approach on the Yangtze River Delta panel dataset.
Artificial intelligence affects firms' new quality productive forces through improvement of innovation output.
Mechanism tests reported by the authors showing empirical evidence that AI improves innovation output (e.g., measured innovation outcomes) which is linked to higher new quality productive forces.
Artificial intelligence affects firms' new quality productive forces through optimization of R&D personnel structure.
Mechanism tests reported by the authors using the constructed indicators and panel data; empirical evidence cited that links AI to changes in R&D personnel structure which in turn link to new quality productive forces.
The promoting effect of artificial intelligence on new quality productive forces is more pronounced among small-sized enterprises.
Heterogeneity tests by firm size in the panel data; authors report stronger positive effects for small-sized firms.
The promoting effect of artificial intelligence on new quality productive forces is more pronounced in Jiangsu and Zhejiang provinces.
Heterogeneity tests on the Yangtze River Delta panel data comparing regional subsamples; authors report stronger positive effects in Jiangsu and Zhejiang.
The positive effect of artificial intelligence on firms' new quality productive forces remains robust after addressing endogeneity concerns and conducting robustness checks.
Authors report endogeneity-corrected estimations and multiple robustness checks on the same panel dataset and constructed firm-level indicators; specific endogeneity correction methods and robustness checks are not detailed in the excerpt.
Artificial intelligence significantly promotes the growth of new quality productive forces in new energy vehicle firms.
Panel data analysis of new energy vehicle firms in the Yangtze River Delta from 2001 to 2023; firm-level indicators of artificial intelligence and new quality productive forces constructed; regression estimation showing a significant positive effect.
Proactive, edge-side prompt optimization can substantially reduce inference costs without sacrificing coding quality.
Aggregate experimental results on token reductions and preserved/improved task accuracy reported in the paper.
Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends.
Head-to-head experimental comparison reported in the paper between the proposed middleware and LLMLingua-2 (matched compression rates) measuring OckScore.
Ablation studies indicate that the gains come primarily from the structural rewriting stage rather than simple function-name extraction.
Ablation experiments reported in the paper comparing full rewrite pipeline versus variants (e.g., function-name extraction only).
Prompt compression via the middleware preserves or improves task accuracy on the evaluated benchmark.
Reported task accuracy comparisons on OMH-Polyglot before and after applying middleware across evaluated backends.
The middleware reduces total tokens (prompt + completion) by up to 18.8 percent.
Empirical measurements reported in the paper comparing total token usage (prompt + completion) with and without middleware.
Across three commercial LLM backends, the middleware reduces prompt tokens by 34–47 percent.
Empirical results reported from experiments on OMH-Polyglot across three commercial LLM backends (aggregate token counts before vs. after middleware).
We introduce a pre-flight, edge-side prompt-rewriting middleware that runs locally (using Llama 3.2 (3B)) to perform cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original.
System implementation and design described in the paper (local Llama 3.2 (3B) model, translation, rewriting, and rewrite-with-fallback mechanism).
Addressing these issues entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.
Specific prescriptive recommendations listed by the authors as part of the proposed research paradigm; offered as proposed methods rather than empirically validated interventions in the excerpt.
The paper calls for a non-solipsistic research paradigm that treats interdependence as a core design principle rather than approaching cooperation as a task to solve.
Normative/research-agenda claim made by the authors; stated in the paper as a recommended change in research approach without empirical tests.
Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence.
Prescriptive/theoretical recommendation by the authors; framed as necessary to address the earlier-claimed train-test-deploy gap, without empirical demonstration in the excerpt.
AI's central challenge is shifting from capability to coexistence.
Author's conceptual assertion in the paper; no empirical data, sample, or experiment reported.
The audit detects significant engagement premiums for three exploitation-related dimensions: performative labor, emotional bait, and privacy violations.
Reported aggregated analysis across labeled dimensions showing positive associations of these dimensions with views; privacy violations mentioned in summary of findings (specific effect size for privacy violations not reported in provided text).
Within-channel analyses indicate median view boosts of +56.0% for performative content (FDR-corrected p < 0.001), with effects holding in same-year robustness checks (p = 0.030).
Within-channel analyses for performative-content label showing median percent boost, FDR-corrected significance, and robustness check restricting comparisons to same-year videos.
Within-channel analyses indicate median view boosts of +65.6% for emotional bait content (FDR-corrected p < 0.001).
Within-channel (fixed-effects or matched) comparisons of emotional-bait-labeled vs. other videos, with multiple-testing correction (FDR); reported median percent boost and p-value.
A mixed-effects regression controlling for channel-level variation shows that a one-unit increase in exploitation score yields a 4.4× increase in views (p < 0.001).
Mixed-effects regression analysis with channel-level random effects on the full video dataset; reported multiplicative effect and p-value.
Exploitation scores correlate with view counts (Spearman ρ = 0.229, p < 10^{-50}).
Spearman rank correlation computed between exploitation scores and view counts across the study dataset (5,051 videos).
A multi-annotator validation study (N=107) shows strong agreement with human judgment: macro-average F1 = 0.911 and high sensitivity for overall exploitation risk (recall = 0.960, F1 = 0.793).
Multi-annotator validation study with 107 human annotations comparing model/weak-supervision labels to human judgments; reported classification metrics.
The positive influence of industrial robot application on MVCR is especially significant in low-technology industries.
Heterogeneity/subsample analysis reported in the paper showing larger estimated effects in low-tech industry groups.
The positive influence of industrial robot application on MVCR is especially significant in downstream segments of the value chain.
Heterogeneity/subsample analysis reported in the paper showing larger estimated effects in downstream value-chain segments.
The positive influence of industrial robot application on MVCR is particularly significant in privately owned businesses.
Heterogeneity/subsample analysis reported in the paper showing stronger estimated effects for privately owned firms compared with other ownership types.
Industrial robot application positively impacts manufacturing value chain resilience (MVCR).
Empirical assessment using the constructed industrial-robot application indices and MVCR index (regression/empirical analysis on Chinese A-share listed firms).
The effect of BDTA on improving CEE is more significant in enterprises with low market concentration.
Heterogeneity/subsample analysis on the listed manufacturing firm data (2010–2023) showing larger BDTA→CEE effects in firms operating in markets with lower concentration.
The effect of BDTA on improving CEE is more significant in high-tech enterprises.
Heterogeneity/subsample analysis reported on listed manufacturing firms (2010–2023) indicating stronger BDTA→CEE effects among high-tech enterprises.
The effect of BDTA on improving CEE is more significant in non-state-owned enterprises.
Heterogeneity analysis (subsample analysis) reported by authors using the 2010–2023 listed manufacturing firm sample, showing stronger BDTA→CEE effects in non-state-owned firms compared to state-owned firms.
BDTA improves CEE of manufacturing enterprises by enhancing internal control quality.
Theoretical channel analysis and empirical mediation/ mechanism tests on listed manufacturing firms (2010–2023) showing internal control quality is a mediator in the BDTA→CEE link.
BDTA improves CEE of manufacturing enterprises by fostering green innovation.
Theoretical channel analysis plus empirical mediation/ mechanism tests using the same sample (listed Chinese manufacturing firms 2010–2023) that show green innovation mediates the BDTA→CEE relationship.
Big data technology application (BDTA) can improve carbon emission efficiency (CEE) of manufacturing enterprises.
Empirical panel regression analysis on listed companies in China's manufacturing industry from 2010 to 2023; authors report baseline regressions showing a positive relationship between BDTA and CEE.