Evidence (4560 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	378	106	59	455	1007
Governance & Regulation	379	176	116	58	739
Research Productivity	240	96	34	294	668
Organizational Efficiency	370	82	63	35	553
Technology Adoption Rate	296	118	66	29	513
Firm Productivity	277	34	68	10	394
AI Safety & Ethics	117	177	44	24	364
Output Quality	244	61	23	26	354
Market Structure	107	123	85	14	334
Decision Quality	168	74	37	19	301
Fiscal & Macroeconomic	75	52	32	21	187
Employment Level	70	32	74	8	186
Skill Acquisition	89	32	39	9	169
Firm Revenue	96	34	22	—	152
Innovation Output	106	12	21	11	151
Consumer Welfare	70	30	37	7	144
Regulatory Compliance	52	61	13	3	129
Inequality Measures	24	68	31	4	127
Task Allocation	75	11	29	6	121
Training Effectiveness	55	12	12	16	96
Error Rate	42	48	6	—	96
Worker Satisfaction	45	32	11	6	94
Task Completion Time	78	5	4	2	89
Wages & Compensation	46	13	19	5	83
Team Performance	44	9	15	7	76
Hiring & Recruitment	39	4	6	3	52
Automation Exposure	18	17	9	5	50
Job Displacement	5	31	12	—	48
Social Protection	21	10	6	2	39
Developer Productivity	29	3	3	1	36
Worker Turnover	10	12	—	3	25
Skill Obsolescence	3	19	2	—	24
Creative Output	15	5	3	1	24
Labor Share of Income	10	4	9	—	23

Productivity Remove filter

CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models.

Authors' claim about potential use-cases and research enabled by the dataset; forward-looking/qualitative statement.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... support for various research directions (capability to enable research)

CUA-Suite provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations.

Dataset/benchmark description in paper: UI-Vision benchmark and GroundCUA counts (56,000 screenshots, >3,600,000 UI element annotations).

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... size and scope of GroundCUA (annotated screenshots and UI element annotations) a...

Continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks (unlike sparse datasets that capture only final click coordinates).

Argument made in paper contrasting continuous video to sparse screenshots/final click coordinates; conceptual/logical claim about information content and transformability.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... information content and transformability of continuous video vs. sparse data

VideoCUA provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video.

Dataset description and counts reported in paper: ~10,000 tasks, 87 applications, 30 fps, ~55 hours, ~6,000,000 frames, plus annotation modalities.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... size and modality coverage of the VideoCUA dataset (tasks, hours, frames, annota...

Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents.

Cites/references recent literature (stated in abstract) asserting the importance of continuous video over sparse screenshots.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... importance of continuous video vs. sparse screenshots for scaling CUAs

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows.

Statement in paper's introduction/abstract; conceptual claim based on prior literature and motivation for the work.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... promise/ability to automate complex desktop workflows

Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities.

Author statement referencing extensive offline evaluations showing these capabilities; no metrics, datasets, or sample sizes provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... query recognition and user profiling performance

OneSearch-V2 introduces a behavior preference alignment optimization system which mitigates reward hacking arising from the single conversion metric and addresses personal preference via direct user feedback.

Methodological description of an optimization/feedback component in the paper; no empirical quantification of mitigation or user-feedback effects provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... mitigation of reward hacking from single-metric optimization and alignment with ...

OneSearch-V2 contains a reasoning-internalized self-distillation training pipeline that uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning.

Methodological description of the training pipeline in the paper; no direct quantitative evidence or ablation results given in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... ability to infer latent user intent beyond behavior logs

OneSearch-V2 includes a thought-augmented complex query understanding module that enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference.

Methodological description of the proposed module in the paper; no standalone evaluation numbers for this module provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... query understanding capability (depth of understanding vs. shallow semantic matc...

OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.

Author claim in the paper stating mitigation of these issues and no added inference/latency costs; no quantitative measures, benchmarks, or latency numbers provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... information bubbles and long-tail sparsity (and inference/serving latency)

Manual evaluation confirms gains in query-item relevance, with +1.37%.

Reported manual evaluation metric in the paper; no sample size or annotation protocol provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... query-item relevance

Manual evaluation confirms gains in search experience quality, with +1.65% in page good rate.

Reported manual evaluation metric in the paper; no sample size or annotation protocol provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... page good rate

OneSearch-V2 increases order volume by +2.11% in online A/B tests.

Reported online A/B test result in the paper; no sample size, test duration, or statistical significance reported in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... order volume

OneSearch-V2 increases buyer conversion rate by +3.05% in online A/B tests.

Reported online A/B test result in the paper; no sample size, test duration, or statistical significance reported in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... buyer conversion rate

OneSearch-V2 increases item CTR by +3.98% in online A/B tests.

Reported online A/B test result in the paper; no sample size, test duration, or statistical significance reported in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... item CTR

OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits.

Author assertion describing OneSearch as industrial-scale and commercially/operationally beneficial; no supporting numerical evidence or sample size reported in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... commercial and operational benefits

Generative Retrieval (GR) offers advantages over multi-stage cascaded architectures such as end-to-end joint optimization and high computational efficiency.

Statement in paper positioning GR as a promising paradigm and listing these advantages; no quantitative study or sample size reported in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... computational efficiency and ability to perform end-to-end joint optimization

Automation in Japanese manufacturing increased even during periods of slow productivity growth.

Empirical finding from applying the framework to industry-level data in Japanese manufacturing; comparison of inferred automation trends with observed productivity growth periods (exact sample/time not provided in the summary).

high positive The macroeconomics of automation trend in automation versus productivity growth (automation increased despite slo...

Applying the framework to Japanese manufacturing industries shows that automation increased through capital deepening.

Empirical application of the theoretical framework to Japanese manufacturing industries (industry-level analysis); estimation/inference using industry macro observables. (Paper states result; exact sample size/time span not provided in the summary.)

high positive The macroeconomics of automation increase in automation (share of tasks by capital) attributable to capital deepe...

The model provides a transparent mapping from standard macroeconomic observables (capital-labor ratio, output per worker, elasticity of substitution) into the degree of automation, allowing automation to be measured without relying on technology-specific indicators.

Theoretical mapping derived from the CES structure that links observable macro variables to the endogenous degree of automation; methodological claim about inference procedure.

high positive The macroeconomics of automation degree of automation inferred from macro observables

Aggregating task-level decisions generates a CES production function in which the economy-wide degree of automation emerges endogenously.

Analytical derivation in the paper: aggregation of task-level adoption decisions yields a CES aggregate production function with endogenous automation parameter.

high positive The macroeconomics of automation form of aggregate production function / emergence of economy-wide automation par...

The degree of automation is defined as the share of tasks performed by capital rather than labor.

Explicit model definition provided in the paper (conceptual/theoretical definition).

high positive The macroeconomics of automation share of tasks performed by capital

The degree of automation in the aggregate economy emerges endogenously as an equilibrium outcome and can be inferred from standard macroeconomic data.

Theoretical development in a task-based production framework with endogenous technology adoption; mapping from model to observable macro variables (capital-labor ratio, output per worker, elasticity of substitution).

high positive The macroeconomics of automation degree of automation (economy-wide share of tasks performed by capital)

These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.

Conclusion drawn by the authors based on their implementation, token reduction, and reported accuracy/latency-related claims; generalization to large-scale production is asserted but not supported by detailed production deployment metrics in the excerpt.

high positive Schema on the Inside: A Two-Phase Fine-Tuning Method for Hig... feasibility of production-grade text-to-SQL (precision and latency)

The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google's Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy).

Reported empirical evaluation comparing the authors' system to a prompt-engineered baseline (Gemini Flash 2.0) with explicit performance percentages for execution success and semantic accuracy; no sample size, test set composition, statistical significance, or evaluation protocol provided in the excerpt.

high positive Schema on the Inside: A Two-Phase Fine-Tuning Method for Hig... execution success rate; semantic accuracy

The approach replaces costly external API calls with efficient local inference.

System design claim: the model is self-hosted and performs local inference instead of using external API-based LLM calls; no cost accounting or latency benchmarks provided in the excerpt.

high positive Schema on the Inside: A Two-Phase Fine-Tuning Method for Hig... use of external API calls vs local inference (cost/efficiency implication)

This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100.

Reported measurement comparing input token counts before and after applying their approach (explicit numerical baseline and resulting counts provided); no sample size or distribution of token counts reported.

high positive Schema on the Inside: A Two-Phase Fine-Tuning Method for Hig... input token count

A novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts.

Methodological description (two-phase supervised fine-tuning) and claim that this internalization removes reliance on long-context prompts; no detailed experimental protocol or sample size provided in the excerpt.

high positive Schema on the Inside: A Two-Phase Fine-Tuning Method for Hig... need for long-context prompts / model internalization of schema

We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11 that answers user queries about cricket statistics.

Stated implementation detail in the paper describing the model architecture and deployment target (CriQ conversational bot). No experimental sample size reported for this statement.

high positive Schema on the Inside: A Two-Phase Fine-Tuning Method for Hig... model specification and deployment

Those extended-model equilibria also show increasing concentration consistent with power-law-like distributions (i.e., winner-take-most / superstar effects).

Theoretical model combining quality heterogeneity and reinforcement dynamics that yields equilibrium distributions with heavy tails; argument and formalization presented in the paper; no empirical testing reported.

high positive The Economics of Builder Saturation in Digital Markets market concentration / distribution of returns (power-law-like)

Even as the number of producers increases and average attention per producer falls, total output expands (production scales elastically).

Same formal theoretical model (analytical result): production scales elastically in the model despite finite attention; no empirical validation provided.

high positive The Economics of Builder Saturation in Digital Markets total market output

If you can prove the value and the effort behind API token spending (agent memory), you can resell it.

Normative/operational claim within the paper's proposal; presented as an implication of verifiable provenance and market layering, with no empirical proof or transactional data.

high positive Infrastructure for Valuable, Tradable, and Verifiable Agent ... resellability of artifacts derived from API token spending

Enabling timely memory transfer reduces repeated exploration.

Argument in the paper asserting that shared/tradable memory decreases redundant exploration; no experimental or observational data provided.

high positive Infrastructure for Valuable, Tradable, and Verifiable Agent ... frequency/amount of repeated exploration by agents

Together, clawgang and meowtrade transform one-shot API token spending into reusable and tradable assets.

High-level systems argument in the paper; no empirical measurements of reuse or tradability presented.

high positive Infrastructure for Valuable, Tradable, and Verifiable Agent ... conversion of one-shot API calls into reusable/tradable assets

Meowtrade is a market layer for listing, transferring, and governing certified memory artifacts.

Design proposal described in the paper; no pilot deployment, user adoption metrics, or experimental data provided.

high positive Infrastructure for Valuable, Tradable, and Verifiable Agent ... existence/functionality of a market layer for certified memory artifacts

Clawgang binds memory to verifiable computational provenance.

System/design claim describing the proposed mechanism (clawgang) in the paper; no implementation results or empirical validation reported.

high positive Infrastructure for Valuable, Tradable, and Verifiable Agent ... ability to cryptographically or procedurally link memories to provenance

Agent memory can serve as an economic commodity in the agent economy, if buyers can verify that it is authentic, effort-backed, and produced in a compatible execution context.

Conceptual argument in the paper's proposal; no empirical evaluation, sample size, or experiments reported.

high positive Infrastructure for Valuable, Tradable, and Verifiable Agent ... feasibility of agent memory becoming a tradable commodity

Economic theory can be used to generate structured synthetic data that improves foundation-model predictions when the theory implies observable patterns in the data.

General conclusion drawn from the paper's experimental findings: improvement in model predictions after fine-tuning on theory-derived synthetic data.

high positive GARP-EFM: Improving Foundation Models with Revealed Preferen... improvement in foundation-model prediction accuracy when using theory-generated ...

Fine-tuning on GARP-consistent synthetic data substantially improves prediction relative to zero-shot Chronos-2 at all forecast horizons we study.

Empirical results comparing fine-tuned Chronos-2 to zero-shot Chronos-2 across multiple forecast horizons on the authors' experimental panel (no numeric metrics or sample sizes given in the excerpt).

high positive GARP-EFM: Improving Foundation Models with Revealed Preferen... forecast prediction accuracy across forecast horizons

The fine-tuned model serves as a rationality-constrained forecasting prior: it learns price-quantity relations from GARP-consistent synthetic histories and then uses those relations to predict the choices of real consumers.

Empirical approach described in paper: model fine-tuned on synthetic GARP-consistent histories and then evaluated on real consumer choice data (supports claim that model transfers learned relations to predicting real choices).

high positive GARP-EFM: Improving Foundation Models with Revealed Preferen... model's ability to predict real consumer choices (use of learned price-quantity ...

GARP is a simple condition to check that allows us to generate time series from a large class of utilities efficiently.

Methodological argument in the paper: authors use GARP as a constructive condition to generate synthetic time series from many utility functions (no numeric efficiency metrics provided in the excerpt).

high positive GARP-EFM: Improving Foundation Models with Revealed Preferen... feasibility/efficiency of generating synthetic time series from utility classes

Teaching them basic economic logic improves how they predict demand using an experimental panel.

Reported experimental results in the paper: fine-tuning models on synthetic, economics-consistent data and evaluating on an experimental panel of consumer demand (no numeric sample size or metrics provided in the excerpt).

high positive GARP-EFM: Improving Foundation Models with Revealed Preferen... prediction accuracy of consumer demand

AI adoption and the associated improved governance lead to higher total factor productivity (TFP).

Empirical analysis showing a positive association between firm-level AI application index and measures of total factor productivity in the 2010–2023 Chinese A-share panel.

high positive The risk-mitigation effects of artificial intelligence adopt... total factor productivity (TFP)

AI adoption and the associated improved governance lead to a lower cost of debt financing for firms.

Empirical tests linking firm-level AI application and governance improvements to measures of debt financing costs (e.g., interest rates on debt, financing spreads) in the Chinese A-share firm sample.

high positive The risk-mitigation effects of artificial intelligence adopt... cost of debt financing (interest rate/spread measures)

The governance risk-mitigation effects of AI operate through enhancing external monitoring.

Mechanism analyses showing that AI adoption is associated with measures of stronger external monitoring (e.g., analyst coverage, media scrutiny, regulator activity) in the firm-year panel, linking that channel to reduced misconduct.

high positive The risk-mitigation effects of artificial intelligence adopt... external monitoring intensity (analyst coverage, media/regulatory scrutiny proxi...

The governance risk-mitigation effects of AI operate through strengthening internal control capacity.

Mechanism analyses showing that higher AI application is associated with improved internal control measures (as reported by firms or regulatory/financial-control indicators) in the dataset of Chinese A-share firms.

high positive The risk-mitigation effects of artificial intelligence adopt... internal control capacity (corporate internal control metrics)

The governance risk-mitigation effects of AI operate through lowering agency costs.

Mechanism analyses reported by authors linking AI adoption to reductions in measures interpreted as agency costs (e.g., agency-cost proxies, corporate governance metrics) in the same firm-year panel.

high positive The risk-mitigation effects of artificial intelligence adopt... agency costs (proxied by governance/financial measures)

AI application significantly reduces the monetary amount of penalties associated with executive misconduct.

Regression analyses on monetary penalty data for Chinese A-share firms (2010–2023) showing a statistically significant negative relationship between firm AI application index and penalty amounts.

high positive The risk-mitigation effects of artificial intelligence adopt... monetary amount of penalties for executive misconduct

AI application significantly reduces the frequency (number) of violations by executives.

Empirical frequency/regression analyses on the firm-year panel of Chinese A-share firms using the AI application index; authors report robust reductions in the number/frequency of violations conditional on AI adoption.

high positive The risk-mitigation effects of artificial intelligence adopt... frequency (count) of executive violations

« Prev 1 2 3 … 24 25 26 … 91 92 Next »