Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

We evaluate 18 proprietary and open source agents on 538 tasks.

Author-reported evaluation methodology and scale (number of agents and tasks) as stated in abstract.

high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... evaluation sample size (agents and tasks)

Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion.

Author description of interaction protocol structure (design specification in paper abstract).

high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... coverage of interaction types (mid-turn and post-turn) in the protocol

DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges.

Author statement in abstract describing the protocol (design/method contribution).

high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... formalization of human-agent interaction protocol

DeskCraft covers professional creative software across design, video, audio, and 3D creation.

Author statement in abstract listing covered software domains.

high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... breadth of software domains covered by the benchmark

DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps.

Benchmark design described in abstract (explicit statement that long-horizon tasks require over 50 execution steps).

high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... task difficulty / horizon length (number of execution steps)

We introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration.

Author statement describing the new benchmark (benchmark design and scope described in paper).

high positive DeskCraft: Benchmarking Desktop Agents on Professional Workf... availability of a benchmark for long-horizon workflows and human-agent collabora...

Taiji demonstrates robust scalability in web-scale environments.

Assertion supported by deployment at large scale and claimed daily user numbers (paper's scalability claim).

high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... system scalability in web-scale production

Taiji yields significant commercial revenue.

Claim in the paper that the deployed system produces notable commercial revenue (no quantitative revenue figures provided in abstract).

high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... commercial revenue impact

Taiji currently serves over 400 million users daily.

Operational usage statistic reported in the paper (daily active users served).

high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... number of users served daily

Taiji has been deployed on Kuaishou's advertising platform since May 2026.

Deployment statement reported in the paper (operational deployment date claimed).

high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... deployment/adoption on a major platform

Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji.

Empirical evaluation section claiming extensive offline experiments and online A/B testing validate the system (no numeric details in abstract).

high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... effectiveness of Taiji on unspecified evaluation metrics (offline and online)

Theoretically, POPO achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences.

Theoretical analysis/claims in the paper asserting optimality of the proposed POPO method (presumably proofs or propositions).

high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... optimality of trade-off between semantic knowledge and collaborative ID preferen...

To resolve the RL alignment issue, Taiji proposes Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights.

Algorithmic contribution described in the paper introducing POPO as an RL method for adaptive reward-weighting.

high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... adaptive adjustment of cross-domain reward weights during RL

To overcome the SFT bottleneck, Taiji utilizes reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific chain-of-thought (CoT) data.

Methodological description in the paper detailing reverse-engineered reasoning and open-ended rejection sampling as data-generation techniques.

high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... quality of generated domain-specific CoT data

We present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems.

Paper's core contribution: proposal of a new framework described in the manuscript.

high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... availability of an LLM-as-Enhancer framework

Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry.

Framing statement in the paper's introduction/abstract asserting industry trend (literature/industry observation).

high positive Taiji: Pareto Optimal Policy Optimization with Semantics-IDs... industry adoption of LLM-based recommender approaches

The paper constructs firm-level indicators of artificial intelligence and new quality productive forces for new energy vehicle firms.

Authors state they constructed firm-level indicators as part of their empirical approach on the Yangtze River Delta panel dataset.

high positive Mechanisms and Effects of Artificial Intelligence on New Qua... construction of measurement indicators (methodological contribution)

Artificial intelligence affects firms' new quality productive forces through improvement of innovation output.

Mechanism tests reported by the authors showing empirical evidence that AI improves innovation output (e.g., measured innovation outcomes) which is linked to higher new quality productive forces.

high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces (mediated by innovation output)

Artificial intelligence affects firms' new quality productive forces through optimization of R&D personnel structure.

Mechanism tests reported by the authors using the constructed indicators and panel data; empirical evidence cited that links AI to changes in R&D personnel structure which in turn link to new quality productive forces.

high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces (mediated by R&D personnel structure)

The promoting effect of artificial intelligence on new quality productive forces is more pronounced among small-sized enterprises.

Heterogeneity tests by firm size in the panel data; authors report stronger positive effects for small-sized firms.

high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces (firm-size heterogeneity)

The promoting effect of artificial intelligence on new quality productive forces is more pronounced in Jiangsu and Zhejiang provinces.

Heterogeneity tests on the Yangtze River Delta panel data comparing regional subsamples; authors report stronger positive effects in Jiangsu and Zhejiang.

high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces (regional heterogeneity)

The positive effect of artificial intelligence on firms' new quality productive forces remains robust after addressing endogeneity concerns and conducting robustness checks.

Authors report endogeneity-corrected estimations and multiple robustness checks on the same panel dataset and constructed firm-level indicators; specific endogeneity correction methods and robustness checks are not detailed in the excerpt.

high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces

Artificial intelligence significantly promotes the growth of new quality productive forces in new energy vehicle firms.

Panel data analysis of new energy vehicle firms in the Yangtze River Delta from 2001 to 2023; firm-level indicators of artificial intelligence and new quality productive forces constructed; regression estimation showing a significant positive effect.

high positive Mechanisms and Effects of Artificial Intelligence on New Qua... new quality productive forces

Proactive, edge-side prompt optimization can substantially reduce inference costs without sacrificing coding quality.

Aggregate experimental results on token reductions and preserved/improved task accuracy reported in the paper.

high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... inference cost (via token usage) and coding quality

Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends.

Head-to-head experimental comparison reported in the paper between the proposed middleware and LLMLingua-2 (matched compression rates) measuring OckScore.

high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... OckScore (a task-specific performance metric)

Ablation studies indicate that the gains come primarily from the structural rewriting stage rather than simple function-name extraction.

Ablation experiments reported in the paper comparing full rewrite pipeline versus variants (e.g., function-name extraction only).

high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... source of performance/token-reduction gains (rewriting vs. function-name extract...

Prompt compression via the middleware preserves or improves task accuracy on the evaluated benchmark.

Reported task accuracy comparisons on OMH-Polyglot before and after applying middleware across evaluated backends.

high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... task accuracy

The middleware reduces total tokens (prompt + completion) by up to 18.8 percent.

Empirical measurements reported in the paper comparing total token usage (prompt + completion) with and without middleware.

high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... total token count (prompt + completion)

Across three commercial LLM backends, the middleware reduces prompt tokens by 34–47 percent.

Empirical results reported from experiments on OMH-Polyglot across three commercial LLM backends (aggregate token counts before vs. after middleware).

high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... prompt token count

We introduce a pre-flight, edge-side prompt-rewriting middleware that runs locally (using Llama 3.2 (3B)) to perform cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original.

System implementation and design described in the paper (local Llama 3.2 (3B) model, translation, rewriting, and rewrite-with-fallback mechanism).

high positive Cross-Lingual Token Arbitrage: Optimizing Code Agent Context... ability to produce an optimized prompt not larger than the original (prompt size...

Addressing these issues entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.

Specific prescriptive recommendations listed by the authors as part of the proposed research paradigm; offered as proposed methods rather than empirically validated interventions in the excerpt.

high positive Solipsistic Superintelligence is Unlikely to be Cooperative recommended design and evaluation practices for AI (dynamic testbeds, institutio...

The paper calls for a non-solipsistic research paradigm that treats interdependence as a core design principle rather than approaching cooperation as a task to solve.

Normative/research-agenda claim made by the authors; stated in the paper as a recommended change in research approach without empirical tests.

high positive Solipsistic Superintelligence is Unlikely to be Cooperative research paradigm orientation (non-solipsistic vs. solipsistic)

Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence.

Prescriptive/theoretical recommendation by the authors; framed as necessary to address the earlier-claimed train-test-deploy gap, without empirical demonstration in the excerpt.

high positive Solipsistic Superintelligence is Unlikely to be Cooperative ability of AI to close the train-test-deploy gap via cooperative participation

AI's central challenge is shifting from capability to coexistence.

Author's conceptual assertion in the paper; no empirical data, sample, or experiment reported.

high positive Solipsistic Superintelligence is Unlikely to be Cooperative the primary challenge for AI development (capability vs. coexistence)

The audit detects significant engagement premiums for three exploitation-related dimensions: performative labor, emotional bait, and privacy violations.

Reported aggregated analysis across labeled dimensions showing positive associations of these dimensions with views; privacy violations mentioned in summary of findings (specific effect size for privacy violations not reported in provided text).

high positive Auditing Engagement Incentives in the Kidfluencer Ecosystem:... view counts (association with labeled exploitation dimensions)

Within-channel analyses indicate median view boosts of +56.0% for performative content (FDR-corrected p < 0.001), with effects holding in same-year robustness checks (p = 0.030).

Within-channel analyses for performative-content label showing median percent boost, FDR-corrected significance, and robustness check restricting comparisons to same-year videos.

high positive Auditing Engagement Incentives in the Kidfluencer Ecosystem:... view counts (median percent boost)

Within-channel analyses indicate median view boosts of +65.6% for emotional bait content (FDR-corrected p < 0.001).

Within-channel (fixed-effects or matched) comparisons of emotional-bait-labeled vs. other videos, with multiple-testing correction (FDR); reported median percent boost and p-value.

high positive Auditing Engagement Incentives in the Kidfluencer Ecosystem:... view counts (median percent boost)

A mixed-effects regression controlling for channel-level variation shows that a one-unit increase in exploitation score yields a 4.4× increase in views (p < 0.001).

Mixed-effects regression analysis with channel-level random effects on the full video dataset; reported multiplicative effect and p-value.

high positive Auditing Engagement Incentives in the Kidfluencer Ecosystem:... view counts

Exploitation scores correlate with view counts (Spearman ρ = 0.229, p < 10^{-50}).

Spearman rank correlation computed between exploitation scores and view counts across the study dataset (5,051 videos).

high positive Auditing Engagement Incentives in the Kidfluencer Ecosystem:... view counts

A multi-annotator validation study (N=107) shows strong agreement with human judgment: macro-average F1 = 0.911 and high sensitivity for overall exploitation risk (recall = 0.960, F1 = 0.793).

Multi-annotator validation study with 107 human annotations comparing model/weak-supervision labels to human judgments; reported classification metrics.

high positive Auditing Engagement Incentives in the Kidfluencer Ecosystem:... classification performance for exploitation detection (F1, recall)

The positive influence of industrial robot application on MVCR is especially significant in low-technology industries.

Heterogeneity/subsample analysis reported in the paper showing larger estimated effects in low-tech industry groups.

high positive Industrial Robot Application and the Manufacturing Value Cha... manufacturing value chain resilience (MVCR) (interaction/heterogeneity by indust...

The positive influence of industrial robot application on MVCR is especially significant in downstream segments of the value chain.

Heterogeneity/subsample analysis reported in the paper showing larger estimated effects in downstream value-chain segments.

high positive Industrial Robot Application and the Manufacturing Value Cha... manufacturing value chain resilience (MVCR) (interaction/heterogeneity by value-...

The positive influence of industrial robot application on MVCR is particularly significant in privately owned businesses.

Heterogeneity/subsample analysis reported in the paper showing stronger estimated effects for privately owned firms compared with other ownership types.

high positive Industrial Robot Application and the Manufacturing Value Cha... manufacturing value chain resilience (MVCR) (interaction/heterogeneity by owners...

Industrial robot application positively impacts manufacturing value chain resilience (MVCR).

Empirical assessment using the constructed industrial-robot application indices and MVCR index (regression/empirical analysis on Chinese A-share listed firms).

high positive Industrial Robot Application and the Manufacturing Value Cha... manufacturing value chain resilience (MVCR)

The effect of BDTA on improving CEE is more significant in enterprises with low market concentration.

Heterogeneity/subsample analysis on the listed manufacturing firm data (2010–2023) showing larger BDTA→CEE effects in firms operating in markets with lower concentration.

high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE) (heterogeneous treatment effect by market conce...

The effect of BDTA on improving CEE is more significant in high-tech enterprises.

Heterogeneity/subsample analysis reported on listed manufacturing firms (2010–2023) indicating stronger BDTA→CEE effects among high-tech enterprises.

high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE) (heterogeneous treatment effect by firm technol...

The effect of BDTA on improving CEE is more significant in non-state-owned enterprises.

Heterogeneity analysis (subsample analysis) reported by authors using the 2010–2023 listed manufacturing firm sample, showing stronger BDTA→CEE effects in non-state-owned firms compared to state-owned firms.

high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE) (heterogeneous treatment effect by ownership)

BDTA improves CEE of manufacturing enterprises by enhancing internal control quality.

Theoretical channel analysis and empirical mediation/ mechanism tests on listed manufacturing firms (2010–2023) showing internal control quality is a mediator in the BDTA→CEE link.

high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE) via internal control quality (mediator)

BDTA improves CEE of manufacturing enterprises by fostering green innovation.

Theoretical channel analysis plus empirical mediation/ mechanism tests using the same sample (listed Chinese manufacturing firms 2010–2023) that show green innovation mediates the BDTA→CEE relationship.

high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE) via green innovation (mediator)

Big data technology application (BDTA) can improve carbon emission efficiency (CEE) of manufacturing enterprises.

Empirical panel regression analysis on listed companies in China's manufacturing industry from 2010 to 2023; authors report baseline regressions showing a positive relationship between BDTA and CEE.

high positive Big data technology application and carbon emission efficien... carbon emission efficiency (CEE)

« Prev 1 2 3 … 98 99 100 … 277 278 Next »