Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Retraining can be better understood as approximate Bayesian inference under computational constraints.

Theoretical argument and decision-theoretic framing presented in the paper (conceptual/mathematical derivation rather than empirical testing).

high positive Retraining as Approximate Bayesian Inference conceptual framing of retraining

The analysis was pre-registered and code and data are publicly available.

Authors' statement in the abstract/paper declaring pre-registration and public release of code and data.

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... research transparency (pre-registration and public code/data)

The meta-d' framework reveals which models 'know what they don't know' versus which merely appear well-calibrated due to criterion placement — a distinction with direct implications for model selection, deployment, and human-AI collaboration.

Interpretation and implications drawn from empirical results showing dissociations between calibration metrics and metacognitive measures (meta-d', M-ratio, criterion shifts); argument that this distinction informs practical decisions about model use.

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... distinction between true metacognitive capacity and apparent calibration driven ...

We applied this framework to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials.

Experimental methods reported in the paper listing the four model variants and total trial count (224,000 factual QA trials).

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... empirical evaluation of models' Type-1 and Type-2 metrics across factual QA tria...

We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio.

Methodological contribution described in the paper: specification of a Type-2 SDT framework and use of meta-d' and M-ratio as measurement constructs.

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... decomposition of Type-1 vs Type-2 capacities using meta-d' and M-ratio

The best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search.

Analysis comparing origins of the best final designs vs. their ILP ranking, reported across the benchmark set (12).

high positive Agent Factories for High Level Synthesis: How Far Can Genera... origin/ranking of best designs relative to ILP candidates

Larger gains on harder benchmarks: streamcluster exceeds 20× and kmeans reaches approximately 10×.

Per-benchmark empirical results reported for streamcluster and kmeans in the evaluation.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... execution/performance speedup for specific benchmarks

Scaling from 1 to 10 agents yields a mean 8.27× speedup over baseline.

Empirical evaluation across the reported benchmark set comparing performance with 1 agent versus 10 agents; mean speedup stated in the results.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... execution/performance speedup relative to baseline

We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus 4.5/4.6) with AMD Vitis HLS.

Experimental setup described in the paper reporting evaluation on 12 kernels drawn from HLS-Eval and Rodinia-HLS, using Claude Code (Opus 4.5/4.6) and AMD Vitis HLS.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... evaluation dataset and toolchain used

In Stage 2, the pipeline launches N expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition.

Method section describing Stage 2 which runs multiple expert agents exploring cross-function optimizations on top ILP solutions.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... description of Stage 2 expert-agent exploration of cross-function optimizations

In Stage 1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint.

Method section describing Stage 1 decomposition, per-sub-kernel optimization and ILP assembly under an area constraint.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... description of Stage 1 decomposition and ILP-based assembly

We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents.

Method description in the paper describing the design and implementation of the two-stage 'agent factory' pipeline.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... existence and design of the two-stage agent factory pipeline

Deployment validation across 43 classrooms demonstrated an 18x efficiency gain in the assessment workflow.

Field deployment described in the paper: system was validated across 43 classrooms and an efficiency gain of 18x in the assessment workflow is reported.

high positive When AI Meets Early Childhood Education: Large Language Mode... efficiency of the assessment workflow (time/resources per assessment)

Interaction2Eval achieves up to 88% agreement with human expert judgments.

Reported evaluation results comparing Interaction2Eval outputs to human expert annotations (rubric-based judgments) on the dataset.

high positive When AI Meets Early Childhood Education: Large Language Mode... agreement between AI-generated assessments and human expert judgments

Interaction2Eval, an LLM-based framework, addresses domain-specific challenges (child speech recognition, Mandarin homophone disambiguation, rubric-based reasoning).

Methodological description in the paper: a specialized LLM-based pipeline designed to handle listed domain challenges; presented as the approach used to extract structured quality indicators.

high positive When AI Meets Early Childhood Education: Large Language Mode... capability to handle domain-specific technical challenges in automated assessmen...

TEPE-TCI-370h is the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations.

Authors' dataset construction and description: 370 hours of recorded interactions from 105 classrooms, annotated with ECQRS-EC and SSTEW rubrics as reported in the paper.

high positive When AI Meets Early Childhood Education: Large Language Mode... availability of a large-scale annotated dataset for preschool teacher-child inte...

The dataset provides a reproducible and scalable foundation for research on technological diffusion, regional digitalisation, and industry-level transformation, and can be readily extended to future years or adapted to other countries.

Text asserts reproducibility, scalability, and extendability of the dataset and methods for future years and other countries.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

By providing indicators for two benchmark years, the dataset supports the study of how AI adoption evolves across the Spanish business landscape.

Text highlights the availability of indicators for 2023 and 2025 and claims this supports temporal study of adoption evolution.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

This multi-dimensional structure enables users to explore territorial patterns, sectoral differences, and size-related disparities in the uptake of AI.

Text claims that the dataset's dimensions make it possible to explore spatial (territorial), sectoral, and size-related patterns in AI uptake.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

For each province–sector–size combination, the dataset reports whether firms adopt AI, whether they apply it internally, whether it is embedded in their offerings, and how many firms have valid website content.

Text explicitly lists the reported indicators at the province–sector–size aggregation level (adoption, internal use, embedded in offerings, count of valid website content).

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

The dataset offers a detailed portrait of AI adoption across regions (NUTS 3), industries, and firm size categories.

Text claims multi-dimensional reporting by region (NUTS 3), industry, and firm size categories in the dataset.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

The pipeline identifies explicit evidence of AI use both in firms' internal processes and embedded in their products or services.

Text states the structured rubric is used to identify explicit evidence of AI use in internal processes and in products/services.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

The paper uses a systemic pipeline based on large language models (LLMs) to segment website text, semantically filter it, and evaluate it with a structured rubric.

Text describes methodological pipeline components (LLM-based segmentation, semantic filtering, structured rubric evaluation).

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... other

The dataset results in 225,628 firm-year observations.

Text explicitly reports 225,628 firm-year observations derived from the dataset across the two benchmark years.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

The paper introduces a nationwide dataset that maps how 112,814 Spanish firms communicate and implement artificial intelligence (AI) on their corporate websites in 2023 and 2025.

Text states dataset coverage and firm count (112,814 firms) and benchmark years (2023 and 2025).

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

These results provide a mechanistic account of how humans adapt their trust in AI confidence signals through experience.

Combined behavioral evidence (N = 200) and computational modeling (LLO + Rescorla–Wagner) presented in the paper.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... mechanistic explanation of trust adaptation to AI confidence signals

The model indicates that humans adapt by updating two components: baseline trust and confidence sensitivity, and they use asymmetric learning rates that prioritize the most informative errors.

Parameter recovery / model-fitting results reported in the paper showing updates to baseline trust and sensitivity parameters and asymmetric learning-rate estimates.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... latent learning parameters (baseline trust, confidence sensitivity, asymmetric l...

A computational model using a linear-in-log-odds (LLO) transformation combined with a Rescorla–Wagner learning rule explains the observed learning dynamics.

Modeling analysis reported in the paper fitting an LLO + Rescorla–Wagner model to participants' behavioral data (N = 200).

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... model fit to behavioral learning dynamics

Humans can compensate for monotonic miscalibration (overconfidence and underconfidence) through repeated experience.

Behavioral experiment results showing participants adapted successfully in overconfidence and underconfidence conditions (N = 200, 50 trials).

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... compensation for monotonic miscalibration (ability to adjust to over/underconfid...

Robust learning occurred across all calibration conditions (standard, overconfidence, underconfidence, reverse) with participants improving accuracy, discrimination, and calibration.

Behavioral experiment (N = 200) reporting consistent learning improvements across the four experimental conditions over 50 trials.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... learning (improvements in accuracy, discrimination, calibration) across conditio...

Participants significantly improved their calibration alignment (alignment between their confidence predictions and actual AI correctness) over 50 trials.

Behavioral experiment (N = 200) reporting improvements in calibration alignment metrics across trials.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... calibration alignment (match between predicted confidence and AI correctness)

Participants significantly improved their discrimination (ability to distinguish correct vs. incorrect AI outputs) over 50 trials.

Behavioral experiment (N = 200) reporting improved discrimination metrics across repeated trials.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... discrimination (ability to separate correct from incorrect AI outputs)

Participants significantly improved their prediction accuracy of the AI's correctness over 50 trials.

Behavioral experiment (N = 200), longitudinal measurement across 50 trials reporting statistically significant improvement in accuracy.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... accuracy (participants' correctness in predicting AI correctness)

All data and models are publicly released.

Statement in abstract asserting public release of datasets and models.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... public availability of data and models

CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models.

Authors' claim about potential use-cases and research enabled by the dataset; forward-looking/qualitative statement.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... support for various research directions (capability to enable research)

CUA-Suite provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations.

Dataset/benchmark description in paper: UI-Vision benchmark and GroundCUA counts (56,000 screenshots, >3,600,000 UI element annotations).

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... size and scope of GroundCUA (annotated screenshots and UI element annotations) a...

Continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks (unlike sparse datasets that capture only final click coordinates).

Argument made in paper contrasting continuous video to sparse screenshots/final click coordinates; conceptual/logical claim about information content and transformability.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... information content and transformability of continuous video vs. sparse data

VideoCUA provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video.

Dataset description and counts reported in paper: ~10,000 tasks, 87 applications, 30 fps, ~55 hours, ~6,000,000 frames, plus annotation modalities.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... size and modality coverage of the VideoCUA dataset (tasks, hours, frames, annota...

Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents.

Cites/references recent literature (stated in abstract) asserting the importance of continuous video over sparse screenshots.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... importance of continuous video vs. sparse screenshots for scaling CUAs

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows.

Statement in paper's introduction/abstract; conceptual claim based on prior literature and motivation for the work.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... promise/ability to automate complex desktop workflows

The framework is designed for direct application to engineering processes for which operational event logs are available.

Statement of intended applicability in the paper and demonstration on a large enterprise procurement workflow (BPI 2019 log).

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... adoptability / applicability to engineering processes

The same quantities that delimit statistically credible autonomy (blind masses, escalation gate, m(s), etc.) also determine expected oversight burden (the framework includes an expected oversight-cost identity over the workflow visitation measure).

Theoretical identity and discussion in the paper plus demonstration on the empirical workflow showing how the introduced quantities relate to expected oversight costs.

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... expected oversight burden / oversight cost

On the held-out split, m(s) = max_a \hat{\pi}(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average.

Empirical evaluation on the paper's held-out test split (chronological 20%); reported average discrepancy between the maximum predicted action probability and realized autonomous-step accuracy.

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... accuracy of autonomous step selection (realized autonomous step accuracy)

Refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668.

Empirical report in the paper showing state-space expansion when additional contextual variables are included in state definition (numbers 42 and 668 stated).

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... other

We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process.

Empirical instantiation described in the paper using the BPI 2019 purchase-to-pay event log; dataset statistics (cases, events, distinct actions) and an 80/20 chronological train/test split are reported.

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... other

We develop a measure-theoretic Markov framework for agentic AI in organizations, whose core quantities are state blind-spot mass B_n(\tau), state-action blind mass B^{SA}_{\pi,n}(\tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure.

Theoretical development presented in the paper (definition and derivation of the measure-theoretic Markov framework and associated quantities).

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... other

Extensive offline evaluations demonstrate OneSearch-V2's strong query recognition and user profiling capabilities.

Author statement referencing extensive offline evaluations showing these capabilities; no metrics, datasets, or sample sizes provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... query recognition and user profiling performance

OneSearch-V2 introduces a behavior preference alignment optimization system which mitigates reward hacking arising from the single conversion metric and addresses personal preference via direct user feedback.

Methodological description of an optimization/feedback component in the paper; no empirical quantification of mitigation or user-feedback effects provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... mitigation of reward hacking from single-metric optimization and alignment with ...

OneSearch-V2 contains a reasoning-internalized self-distillation training pipeline that uncovers users' potential yet precise e-commerce intentions beyond log-fitting through implicit in-context learning.

Methodological description of the training pipeline in the paper; no direct quantitative evidence or ablation results given in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... ability to infer latent user intent beyond behavior logs

OneSearch-V2 includes a thought-augmented complex query understanding module that enables deep query understanding and overcomes the shallow semantic matching limitations of direct inference.

Methodological description of the proposed module in the paper; no standalone evaluation numbers for this module provided in the excerpt.

high positive OneSearch-V2: The Latent Reasoning Enhanced Self-distillatio... query understanding capability (depth of understanding vs. shallow semantic matc...

« Prev 1 2 3 … 165 166 167 … 277 278 Next »