The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13870 claims)

Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 749 196 98 892 1984
Governance & Regulation 817 394 188 121 1544
Organizational Efficiency 771 189 124 83 1177
Technology Adoption Rate 627 233 123 96 1088
Research Productivity 411 123 56 332 933
Output Quality 467 178 59 47 751
Decision Quality 320 174 75 42 618
Firm Productivity 435 55 88 20 604
AI Safety & Ethics 214 276 65 33 593
Market Structure 178 167 122 24 496
Task Allocation 207 64 71 32 379
Skill Acquisition 165 59 60 17 301
Innovation Output 203 27 43 18 292
Employment Level 105 52 107 13 279
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 116 63 42 11 232
Firm Revenue 150 48 26 3 227
Inequality Measures 44 122 49 6 221
Task Completion Time 169 29 8 12 219
Worker Satisfaction 89 63 20 12 184
Error Rate 69 92 10 2 173
Regulatory Compliance 76 68 14 5 163
Training Effectiveness 93 21 13 19 148
Wages & Compensation 77 36 25 6 144
Automation Exposure 51 54 22 12 142
Team Performance 86 17 27 9 140
Developer Productivity 94 17 14 6 132
Job Displacement 12 80 20 1 113
Hiring & Recruitment 51 7 8 3 69
Creative Output 31 17 7 3 59
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
We evaluate 18 proprietary and open source agents on 538 tasks.
Author-reported evaluation methodology and scale (number of agents and tasks) as stated in abstract.
high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... evaluation sample size (agents and tasks)
Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion.
Author description of interaction protocol structure (design specification in paper abstract).
high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... coverage of interaction types (mid-turn and post-turn) in the protocol
DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges.
Author statement in abstract describing the protocol (design/method contribution).
high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... formalization of human-agent interaction protocol
DeskCraft covers professional creative software across design, video, audio, and 3D creation.
Author statement in abstract listing covered software domains.
high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... breadth of software domains covered by the benchmark
DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps.
Benchmark design described in abstract (explicit statement that long-horizon tasks require over 50 execution steps).
high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... task difficulty / horizon length (number of execution steps)
We introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration.
Author statement describing the new benchmark (benchmark design and scope described in paper).
high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... availability of a benchmark for long-horizon workflows and human-agent collabora...
Taiji demonstrates robust scalability in web-scale environments.
Assertion supported by deployment at large scale and claimed daily user numbers (paper's scalability claim).
high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... system scalability in web-scale production
Taiji yields significant commercial revenue.
Claim in the paper that the deployed system produces notable commercial revenue (no quantitative revenue figures provided in abstract).
high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... commercial revenue impact
Taiji currently serves over 400 million users daily.
Operational usage statistic reported in the paper (daily active users served).
high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... number of users served daily
Taiji has been deployed on Kuaishou's advertising platform since May 2026.
Deployment statement reported in the paper (operational deployment date claimed).
high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... deployment/adoption on a major platform
Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji.
Empirical evaluation section claiming extensive offline experiments and online A/B testing validate the system (no numeric details in abstract).
high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... effectiveness of Taiji on unspecified evaluation metrics (offline and online)
Theoretically, POPO achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences.
Theoretical analysis/claims in the paper asserting optimality of the proposed POPO method (presumably proofs or propositions).
high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... optimality of trade-off between semantic knowledge and collaborative ID preferen...
To resolve the RL alignment issue, Taiji proposes Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights.
Algorithmic contribution described in the paper introducing POPO as an RL method for adaptive reward-weighting.
high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... adaptive adjustment of cross-domain reward weights during RL
To overcome the SFT bottleneck, Taiji utilizes reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific chain-of-thought (CoT) data.
Methodological description in the paper detailing reverse-engineered reasoning and open-ended rejection sampling as data-generation techniques.
high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... quality of generated domain-specific CoT data
We present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems.
Paper's core contribution: proposal of a new framework described in the manuscript.
high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... availability of an LLM-as-Enhancer framework
Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry.
Framing statement in the paper's introduction/abstract asserting industry trend (literature/industry observation).
high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... industry adoption of LLM-based recommender approaches
The paper constructs firm-level indicators of artificial intelligence and new quality productive forces for new energy vehicle firms.
Authors state they constructed firm-level indicators as part of their empirical approach on the Yangtze River Delta panel dataset.
high positive Mechanisms and Effects of Artificial Intelligence on New Qua... construction of measurement indicators (methodological contribution)
Artificial intelligence affects firms' new quality productive forces through improvement of innovation output.
Mechanism tests reported by the authors showing empirical evidence that AI improves innovation output (e.g., measured innovation outcomes) which is linked to higher new quality productive forces.
high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces (mediated by innovation output)
Artificial intelligence affects firms' new quality productive forces through optimization of R&D personnel structure.
Mechanism tests reported by the authors using the constructed indicators and panel data; empirical evidence cited that links AI to changes in R&D personnel structure which in turn link to new quality productive forces.
high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces (mediated by R&D personnel structure)
The promoting effect of artificial intelligence on new quality productive forces is more pronounced among small-sized enterprises.
Heterogeneity tests by firm size in the panel data; authors report stronger positive effects for small-sized firms.
high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces (firm-size heterogeneity)
The promoting effect of artificial intelligence on new quality productive forces is more pronounced in Jiangsu and Zhejiang provinces.
Heterogeneity tests on the Yangtze River Delta panel data comparing regional subsamples; authors report stronger positive effects in Jiangsu and Zhejiang.
high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces (regional heterogeneity)
The positive effect of artificial intelligence on firms' new quality productive forces remains robust after addressing endogeneity concerns and conducting robustness checks.
Authors report endogeneity-corrected estimations and multiple robustness checks on the same panel dataset and constructed firm-level indicators; specific endogeneity correction methods and robustness checks are not detailed in the excerpt.
high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces
Artificial intelligence significantly promotes the growth of new quality productive forces in new energy vehicle firms.
Panel data analysis of new energy vehicle firms in the Yangtze River Delta from 2001 to 2023; firm-level indicators of artificial intelligence and new quality productive forces constructed; regression estimation showing a significant positive effect.
high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces
Proactive, edge-side prompt optimization can substantially reduce inference costs without sacrificing coding quality.
Aggregate experimental results on token reductions and preserved/improved task accuracy reported in the paper.
high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... inference cost (via token usage) and coding quality
Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends.
Head-to-head experimental comparison reported in the paper between the proposed middleware and LLMLingua-2 (matched compression rates) measuring OckScore.
high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... OckScore (a task-specific performance metric)
Ablation studies indicate that the gains come primarily from the structural rewriting stage rather than simple function-name extraction.
Ablation experiments reported in the paper comparing full rewrite pipeline versus variants (e.g., function-name extraction only).
high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... source of performance/token-reduction gains (rewriting vs. function-name extract...
Prompt compression via the middleware preserves or improves task accuracy on the evaluated benchmark.
Reported task accuracy comparisons on OMH-Polyglot before and after applying middleware across evaluated backends.
The middleware reduces total tokens (prompt + completion) by up to 18.8 percent.
Empirical measurements reported in the paper comparing total token usage (prompt + completion) with and without middleware.
high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... total token count (prompt + completion)
Across three commercial LLM backends, the middleware reduces prompt tokens by 34–47 percent.
Empirical results reported from experiments on OMH-Polyglot across three commercial LLM backends (aggregate token counts before vs. after middleware).
We introduce a pre-flight, edge-side prompt-rewriting middleware that runs locally (using Llama 3.2 (3B)) to perform cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original.
System implementation and design described in the paper (local Llama 3.2 (3B) model, translation, rewriting, and rewrite-with-fallback mechanism).
high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... ability to produce an optimized prompt not larger than the original (prompt size...
Addressing these issues entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.
Specific prescriptive recommendations listed by the authors as part of the proposed research paradigm; offered as proposed methods rather than empirically validated interventions in the excerpt.
high positive Solipsistic Superintelligence is Unlikely to be Cooperative recommended design and evaluation practices for AI (dynamic testbeds, institutio...
The paper calls for a non-solipsistic research paradigm that treats interdependence as a core design principle rather than approaching cooperation as a task to solve.
Normative/research-agenda claim made by the authors; stated in the paper as a recommended change in research approach without empirical tests.
high positive Solipsistic Superintelligence is Unlikely to be Cooperative research paradigm orientation (non-solipsistic vs. solipsistic)
Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence.
Prescriptive/theoretical recommendation by the authors; framed as necessary to address the earlier-claimed train-test-deploy gap, without empirical demonstration in the excerpt.
high positive Solipsistic Superintelligence is Unlikely to be Cooperative ability of AI to close the train-test-deploy gap via cooperative participation
AI's central challenge is shifting from capability to coexistence.
Author's conceptual assertion in the paper; no empirical data, sample, or experiment reported.
high positive Solipsistic Superintelligence is Unlikely to be Cooperative the primary challenge for AI development (capability vs. coexistence)
The audit detects significant engagement premiums for three exploitation-related dimensions: performative labor, emotional bait, and privacy violations.
Reported aggregated analysis across labeled dimensions showing positive associations of these dimensions with views; privacy violations mentioned in summary of findings (specific effect size for privacy violations not reported in provided text).
high positive Auditing Engagement Incentives in the Kidfluencer Ecosystem:... view counts (association with labeled exploitation dimensions)
Within-channel analyses indicate median view boosts of +56.0% for performative content (FDR-corrected p < 0.001), with effects holding in same-year robustness checks (p = 0.030).
Within-channel analyses for performative-content label showing median percent boost, FDR-corrected significance, and robustness check restricting comparisons to same-year videos.
high positive Auditing Engagement Incentives in the Kidfluencer Ecosystem:... view counts (median percent boost)
Within-channel analyses indicate median view boosts of +65.6% for emotional bait content (FDR-corrected p < 0.001).
Within-channel (fixed-effects or matched) comparisons of emotional-bait-labeled vs. other videos, with multiple-testing correction (FDR); reported median percent boost and p-value.
high positive Auditing Engagement Incentives in the Kidfluencer Ecosystem:... view counts (median percent boost)
A mixed-effects regression controlling for channel-level variation shows that a one-unit increase in exploitation score yields a 4.4× increase in views (p < 0.001).
Mixed-effects regression analysis with channel-level random effects on the full video dataset; reported multiplicative effect and p-value.
Exploitation scores correlate with view counts (Spearman ρ = 0.229, p < 10^{-50}).
Spearman rank correlation computed between exploitation scores and view counts across the study dataset (5,051 videos).
A multi-annotator validation study (N=107) shows strong agreement with human judgment: macro-average F1 = 0.911 and high sensitivity for overall exploitation risk (recall = 0.960, F1 = 0.793).
Multi-annotator validation study with 107 human annotations comparing model/weak-supervision labels to human judgments; reported classification metrics.
high positive Auditing Engagement Incentives in the Kidfluencer Ecosystem:... classification performance for exploitation detection (F1, recall)
The positive influence of industrial robot application on MVCR is especially significant in low-technology industries.
Heterogeneity/subsample analysis reported in the paper showing larger estimated effects in low-tech industry groups.
high positive Industrial Robot Application and the Manufacturing Value Cha... manufacturing value chain resilience (MVCR) (interaction/heterogeneity by indust...
The positive influence of industrial robot application on MVCR is especially significant in downstream segments of the value chain.
Heterogeneity/subsample analysis reported in the paper showing larger estimated effects in downstream value-chain segments.
high positive Industrial Robot Application and the Manufacturing Value Cha... manufacturing value chain resilience (MVCR) (interaction/heterogeneity by value-...
The positive influence of industrial robot application on MVCR is particularly significant in privately owned businesses.
Heterogeneity/subsample analysis reported in the paper showing stronger estimated effects for privately owned firms compared with other ownership types.
high positive Industrial Robot Application and the Manufacturing Value Cha... manufacturing value chain resilience (MVCR) (interaction/heterogeneity by owners...
Industrial robot application positively impacts manufacturing value chain resilience (MVCR).
Empirical assessment using the constructed industrial-robot application indices and MVCR index (regression/empirical analysis on Chinese A-share listed firms).
high positive Industrial Robot Application and the Manufacturing Value Cha... manufacturing value chain resilience (MVCR)
The effect of BDTA on improving CEE is more significant in enterprises with low market concentration.
Heterogeneity/subsample analysis on the listed manufacturing firm data (2010–2023) showing larger BDTA→CEE effects in firms operating in markets with lower concentration.
high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE) (heterogeneous treatment effect by market conce...
The effect of BDTA on improving CEE is more significant in high-tech enterprises.
Heterogeneity/subsample analysis reported on listed manufacturing firms (2010–2023) indicating stronger BDTA→CEE effects among high-tech enterprises.
high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE) (heterogeneous treatment effect by firm technol...
The effect of BDTA on improving CEE is more significant in non-state-owned enterprises.
Heterogeneity analysis (subsample analysis) reported by authors using the 2010–2023 listed manufacturing firm sample, showing stronger BDTA→CEE effects in non-state-owned firms compared to state-owned firms.
high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE) (heterogeneous treatment effect by ownership)
BDTA improves CEE of manufacturing enterprises by enhancing internal control quality.
Theoretical channel analysis and empirical mediation/ mechanism tests on listed manufacturing firms (2010–2023) showing internal control quality is a mediator in the BDTA→CEE link.
high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE) via internal control quality (mediator)
BDTA improves CEE of manufacturing enterprises by fostering green innovation.
Theoretical channel analysis plus empirical mediation/ mechanism tests using the same sample (listed Chinese manufacturing firms 2010–2023) that show green innovation mediates the BDTA→CEE relationship.
high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE) via green innovation (mediator)
Big data technology application (BDTA) can improve carbon emission efficiency (CEE) of manufacturing enterprises.
Empirical panel regression analysis on listed companies in China's manufacturing industry from 2010 to 2023; authors report baseline regressions showing a positive relationship between BDTA and CEE.
high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE)