Evidence (8066 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	417	113	67	480	1091
Governance & Regulation	419	202	124	64	823
Research Productivity	261	100	34	303	703
Organizational Efficiency	406	96	71	40	616
Technology Adoption Rate	323	128	74	38	568
Firm Productivity	307	38	70	12	432
Output Quality	260	71	27	29	387
AI Safety & Ethics	118	179	45	24	368
Market Structure	107	128	85	14	339
Decision Quality	177	75	37	19	312
Fiscal & Macroeconomic	89	58	33	22	209
Employment Level	74	34	78	9	197
Skill Acquisition	98	36	40	9	183
Innovation Output	121	12	24	13	171
Firm Revenue	98	35	24	—	157
Consumer Welfare	73	31	37	7	148
Task Allocation	87	16	34	7	144
Inequality Measures	25	76	32	5	138
Regulatory Compliance	54	61	13	3	131
Task Completion Time	89	7	4	3	103
Error Rate	44	51	6	—	101
Training Effectiveness	58	12	12	16	99
Worker Satisfaction	47	33	11	7	98
Wages & Compensation	54	15	20	5	94
Team Performance	47	12	15	7	82
Automation Exposure	27	26	10	6	72
Job Displacement	6	39	13	—	58
Hiring & Recruitment	40	4	6	3	53
Developer Productivity	34	4	3	1	42
Social Protection	22	11	6	2	41
Creative Output	16	7	5	1	29
Labor Share of Income	12	6	9	—	27
Skill Obsolescence	3	20	2	—	25
Worker Turnover	10	12	—	3	25

Humans can compensate for monotonic miscalibration (overconfidence and underconfidence) through repeated experience.

Behavioral experiment results showing participants adapted successfully in overconfidence and underconfidence conditions (N = 200, 50 trials).

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... compensation for monotonic miscalibration (ability to adjust to over/underconfid...

Robust learning occurred across all calibration conditions (standard, overconfidence, underconfidence, reverse) with participants improving accuracy, discrimination, and calibration.

Behavioral experiment (N = 200) reporting consistent learning improvements across the four experimental conditions over 50 trials.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... learning (improvements in accuracy, discrimination, calibration) across conditio...

Participants significantly improved their calibration alignment (alignment between their confidence predictions and actual AI correctness) over 50 trials.

Behavioral experiment (N = 200) reporting improvements in calibration alignment metrics across trials.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... calibration alignment (match between predicted confidence and AI correctness)

Participants significantly improved their discrimination (ability to distinguish correct vs. incorrect AI outputs) over 50 trials.

Behavioral experiment (N = 200) reporting improved discrimination metrics across repeated trials.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... discrimination (ability to separate correct from incorrect AI outputs)

Participants significantly improved their prediction accuracy of the AI's correctness over 50 trials.

Behavioral experiment (N = 200), longitudinal measurement across 50 trials reporting statistically significant improvement in accuracy.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... accuracy (participants' correctness in predicting AI correctness)

All data and models are publicly released.

Statement in abstract asserting public release of datasets and models.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... public availability of data and models

CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models.

Authors' claim about potential use-cases and research enabled by the dataset; forward-looking/qualitative statement.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... support for various research directions (capability to enable research)

CUA-Suite provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations.

Dataset/benchmark description in paper: UI-Vision benchmark and GroundCUA counts (56,000 screenshots, >3,600,000 UI element annotations).

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... size and scope of GroundCUA (annotated screenshots and UI element annotations) a...

Continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks (unlike sparse datasets that capture only final click coordinates).

Argument made in paper contrasting continuous video to sparse screenshots/final click coordinates; conceptual/logical claim about information content and transformability.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... information content and transformability of continuous video vs. sparse data

VideoCUA provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video.

Dataset description and counts reported in paper: ~10,000 tasks, 87 applications, 30 fps, ~55 hours, ~6,000,000 frames, plus annotation modalities.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... size and modality coverage of the VideoCUA dataset (tasks, hours, frames, annota...

Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents.

Cites/references recent literature (stated in abstract) asserting the importance of continuous video over sparse screenshots.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... importance of continuous video vs. sparse screenshots for scaling CUAs

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows.

Statement in paper's introduction/abstract; conceptual claim based on prior literature and motivation for the work.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... promise/ability to automate complex desktop workflows

The framework is designed for direct application to engineering processes for which operational event logs are available.

Statement of intended applicability in the paper and demonstration on a large enterprise procurement workflow (BPI 2019 log).

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... adoptability / applicability to engineering processes

The same quantities that delimit statistically credible autonomy (blind masses, escalation gate, m(s), etc.) also determine expected oversight burden (the framework includes an expected oversight-cost identity over the workflow visitation measure).

Theoretical identity and discussion in the paper plus demonstration on the empirical workflow showing how the introduced quantities relate to expected oversight costs.

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... expected oversight burden / oversight cost

On the held-out split, m(s) = max_a \hat{\pi}(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average.

Empirical evaluation on the paper's held-out test split (chronological 20%); reported average discrepancy between the maximum predicted action probability and realized autonomous-step accuracy.

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... accuracy of autonomous step selection (realized autonomous step accuracy)

Refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668.

Empirical report in the paper showing state-space expansion when additional contextual variables are included in state definition (numbers 42 and 668 stated).

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... other

We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process.

Empirical instantiation described in the paper using the BPI 2019 purchase-to-pay event log; dataset statistics (cases, events, distinct actions) and an 80/20 chronological train/test split are reported.

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... other

We develop a measure-theoretic Markov framework for agentic AI in organizations, whose core quantities are state blind-spot mass B_n(\tau), state-action blind mass B^{SA}_{\pi,n}(\tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure.

Theoretical development presented in the paper (definition and derivation of the measure-theoretic Markov framework and associated quantities).

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... other

Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities.

Author statement referencing extensive offline evaluations showing these capabilities; no metrics, datasets, or sample sizes provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... query recognition and user profiling performance

OneSearch-V2 introduces a behavior preference alignment optimization system which mitigates reward hacking arising from the single conversion metric and addresses personal preference via direct user feedback.

Methodological description of an optimization/feedback component in the paper; no empirical quantification of mitigation or user-feedback effects provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... mitigation of reward hacking from single-metric optimization and alignment with ...

OneSearch-V2 contains a reasoning-internalized self-distillation training pipeline that uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning.

Methodological description of the training pipeline in the paper; no direct quantitative evidence or ablation results given in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... ability to infer latent user intent beyond behavior logs

OneSearch-V2 includes a thought-augmented complex query understanding module that enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference.

Methodological description of the proposed module in the paper; no standalone evaluation numbers for this module provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... query understanding capability (depth of understanding vs. shallow semantic matc...

OneSearch-V2 effectively mitigates common search system issues such as information bubbles and long-tail sparsity, without incurring additional inference costs or serving latency.

Author claim in the paper stating mitigation of these issues and no added inference/latency costs; no quantitative measures, benchmarks, or latency numbers provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... information bubbles and long-tail sparsity (and inference/serving latency)

Manual evaluation confirms gains in query-item relevance, with +1.37%.

Reported manual evaluation metric in the paper; no sample size or annotation protocol provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... query-item relevance

Manual evaluation confirms gains in search experience quality, with +1.65% in page good rate.

Reported manual evaluation metric in the paper; no sample size or annotation protocol provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... page good rate

OneSearch-V2 increases order volume by +2.11% in online A/B tests.

Reported online A/B test result in the paper; no sample size, test duration, or statistical significance reported in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... order volume

OneSearch-V2 increases buyer conversion rate by +3.05% in online A/B tests.

Reported online A/B test result in the paper; no sample size, test duration, or statistical significance reported in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... buyer conversion rate

OneSearch-V2 increases item CTR by +3.98% in online A/B tests.

Reported online A/B test result in the paper; no sample size, test duration, or statistical significance reported in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... item CTR

OneSearch, as a representative industrial-scale deployed generative search framework, has brought significant commercial and operational benefits.

Author assertion describing OneSearch as industrial-scale and commercially/operationally beneficial; no supporting numerical evidence or sample size reported in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... commercial and operational benefits

Generative Retrieval (GR) offers advantages over multi-stage cascaded architectures such as end-to-end joint optimization and high computational efficiency.

Statement in paper positioning GR as a promising paradigm and listing these advantages; no quantitative study or sample size reported in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... computational efficiency and ability to perform end-to-end joint optimization

The framework aims to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration.

Stated aims and intended impact in paper; aspirational/conceptual rather than empirically demonstrated in excerpt.

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... benchmarks, cumulative research, safety and accountability in human-AI collabora...

Operationalizing evaluation through interaction traces rather than model properties or self-reported trust enables deployment-relevant assessment of calibration, error recovery, and governance.

Methodological claim/proposed approach in paper; presented as enabling assessment but no empirical evaluation reported in excerpt.

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... assessment of calibration, error recovery, governance via interaction traces

The taxonomy and metrics are connected to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration.

Conceptual mapping described in paper; no empirical tests or sample reported in excerpt.

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... linking metrics to U-C-I onboarding lifecycle

We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time.

Explicit methodological claim in paper announcing a taxonomy; described as a contribution rather than empirically tested in excerpt.

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... evaluation metrics taxonomy (outcomes, reliance behavior, safety signals, learni...

This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness.

Methodological contribution presented in paper; conceptual framework proposed (no empirical validation reported in excerpt).

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... team readiness evaluation

Artificial intelligence (AI) systems are deployed as collaborators in human decision-making.

Statement in paper (conceptual/observational claim); no empirical sample or method provided in excerpt.

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... deployment of AI as collaborators

Late disclosure of AI involvement improved affective engagement for AI-enhanced content.

Reported experimental result in the abstract from the two online studies (study 1: n = 325; study 2: n = 371) manipulating disclosure timing (early vs. late).

high positive AI content labeling and user engagement on social media: The... affective engagement for AI-enhanced content under late disclosure

Automation in Japanese manufacturing increased even during periods of slow productivity growth.

Empirical finding from applying the framework to industry-level data in Japanese manufacturing; comparison of inferred automation trends with observed productivity growth periods (exact sample/time not provided in the summary).

high positive The macroeconomics of automation trend in automation versus productivity growth (automation increased despite slo...

Applying the framework to Japanese manufacturing industries shows that automation increased through capital deepening.

Empirical application of the theoretical framework to Japanese manufacturing industries (industry-level analysis); estimation/inference using industry macro observables. (Paper states result; exact sample size/time span not provided in the summary.)

high positive The macroeconomics of automation increase in automation (share of tasks by capital) attributable to capital deepe...

The model provides a transparent mapping from standard macroeconomic observables (capital-labor ratio, output per worker, elasticity of substitution) into the degree of automation, allowing automation to be measured without relying on technology-specific indicators.

Theoretical mapping derived from the CES structure that links observable macro variables to the endogenous degree of automation; methodological claim about inference procedure.

high positive The macroeconomics of automation degree of automation inferred from macro observables

Aggregating task-level decisions generates a CES production function in which the economy-wide degree of automation emerges endogenously.

Analytical derivation in the paper: aggregation of task-level adoption decisions yields a CES aggregate production function with endogenous automation parameter.

high positive The macroeconomics of automation form of aggregate production function / emergence of economy-wide automation par...

The degree of automation is defined as the share of tasks performed by capital rather than labor.

Explicit model definition provided in the paper (conceptual/theoretical definition).

high positive The macroeconomics of automation share of tasks performed by capital

The degree of automation in the aggregate economy emerges endogenously as an equilibrium outcome and can be inferred from standard macroeconomic data.

Theoretical development in a task-based production framework with endogenous technology adoption; mapping from model to observable macro variables (capital-labor ratio, output per worker, elasticity of substitution).

high positive The macroeconomics of automation degree of automation (economy-wide share of tasks performed by capital)

The results of this regional research outline a multi-dimensional policy roadmap that dives deep into the region’s current capabilities and the hurdles it faces in catching up with the AI revolution from a governance and policy perspective, presenting them in a practical framework for public sector leaders.

Report summary claiming that the study's results produce a comprehensive roadmap and practical framework (content description).

high positive Charting AI Governance Future in the Arab Region: A Policy R... comprehensiveness and practicality of the policy roadmap produced by the study

This executive report provides a roadmap for establishing an AI governance infrastructure through a set of strategic policy recommendations across seven key pillars.

Document assertion describing the content and structure of the report (authors' deliverable).

high positive Charting AI Governance Future in the Arab Region: A Policy R... existence of a multi-pillar policy roadmap in the report

The reality of limited AI governance capacity calls for a series of policy interventions at both local and regional levels to empower the AI ecosystem in the Arab region.

Authors' policy recommendation derived from the regional study and synthesis of findings.

high positive Charting AI Governance Future in the Arab Region: A Policy R... adoption of policy interventions to strengthen AI governance and ecosystem

A governance model linking 'trustworthy AI' practices to competitive advantage yields reduced uncertainty, faster deployment cycles, and higher stakeholder trust.

Central claim of the paper tying the proposed AIGSF to business benefits; supported by conceptual linkage and illustrative examples rather than quantified empirical evidence or controlled evaluation.

high positive Artificial Intelligence Governance In Corporate Strategy: Et... firm_revenue

Case illustrations across hiring, credit, consumer services, and generative AI draw lessons on controls such as model documentation, algorithmic audits, impact assessments, and human-in-the-loop oversight.

Paper includes qualitative case illustrations in the listed domains to demonstrate governance controls; these are presented as examples and lessons rather than as systematic empirical studies (no sample sizes reported).

high positive Artificial Intelligence Governance In Corporate Strategy: Et... regulatory_compliance

The paper develops an AI Governance Strategic Framework (AIGSF) and an implementation roadmap that connect ethical accountability, regulatory readiness, cybersecurity resilience, and performance outcomes.

Paper contribution described as an integrative conceptual framework and roadmap; supported by theoretical grounding and illustrative cases rather than empirical validation; no sample size provided.

high positive Artificial Intelligence Governance In Corporate Strategy: Et... organizational_efficiency

AI governance should be treated as a strategic governance function—anchored in board oversight and enterprise risk management—rather than a narrow technical or compliance task.

Central normative recommendation and thesis of the paper; derived from an integrative conceptual framework grounded in corporate governance theory, ERM, and emerging regulation. No empirical testing or sample reported.

high positive Artificial Intelligence Governance In Corporate Strategy: Et... governance_and_regulation

« Prev 1 2 3 … 55 56 57 … 161 162 Next »