Evidence (3103 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	378	106	59	455	1007
Governance & Regulation	379	176	116	58	739
Research Productivity	240	96	34	294	668
Organizational Efficiency	370	82	63	35	553
Technology Adoption Rate	296	118	66	29	513
Firm Productivity	277	34	68	10	394
AI Safety & Ethics	117	177	44	24	364
Output Quality	244	61	23	26	354
Market Structure	107	123	85	14	334
Decision Quality	168	74	37	19	301
Fiscal & Macroeconomic	75	52	32	21	187
Employment Level	70	32	74	8	186
Skill Acquisition	89	32	39	9	169
Firm Revenue	96	34	22	—	152
Innovation Output	106	12	21	11	151
Consumer Welfare	70	30	37	7	144
Regulatory Compliance	52	61	13	3	129
Inequality Measures	24	68	31	4	127
Task Allocation	75	11	29	6	121
Training Effectiveness	55	12	12	16	96
Error Rate	42	48	6	—	96
Worker Satisfaction	45	32	11	6	94
Task Completion Time	78	5	4	2	89
Wages & Compensation	46	13	19	5	83
Team Performance	44	9	15	7	76
Hiring & Recruitment	39	4	6	3	52
Automation Exposure	18	17	9	5	50
Job Displacement	5	31	12	—	48
Social Protection	21	10	6	2	39
Developer Productivity	29	3	3	1	36
Worker Turnover	10	12	—	3	25
Skill Obsolescence	3	19	2	—	24
Creative Output	15	5	3	1	24
Labor Share of Income	10	4	9	—	23

Human Ai Collab Remove filter

The analysis was pre-registered and code and data are publicly available.

Authors' statement in the abstract/paper declaring pre-registration and public release of code and data.

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... research transparency (pre-registration and public code/data)

The meta-d' framework reveals which models 'know what they don't know' versus which merely appear well-calibrated due to criterion placement — a distinction with direct implications for model selection, deployment, and human-AI collaboration.

Interpretation and implications drawn from empirical results showing dissociations between calibration metrics and metacognitive measures (meta-d', M-ratio, criterion shifts); argument that this distinction informs practical decisions about model use.

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... distinction between true metacognitive capacity and apparent calibration driven ...

We applied this framework to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials.

Experimental methods reported in the paper listing the four model variants and total trial count (224,000 factual QA trials).

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... empirical evaluation of models' Type-1 and Type-2 metrics across factual QA tria...

We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio.

Methodological contribution described in the paper: specification of a Type-2 SDT framework and use of meta-d' and M-ratio as measurement constructs.

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... decomposition of Type-1 vs Type-2 capacities using meta-d' and M-ratio

Deployment validation across 43 classrooms demonstrated an 18x efficiency gain in the assessment workflow.

Field deployment described in the paper: system was validated across 43 classrooms and an efficiency gain of 18x in the assessment workflow is reported.

high positive When AI Meets Early Childhood Education: Large Language Mode... efficiency of the assessment workflow (time/resources per assessment)

Interaction2Eval achieves up to 88% agreement with human expert judgments.

Reported evaluation results comparing Interaction2Eval outputs to human expert annotations (rubric-based judgments) on the dataset.

high positive When AI Meets Early Childhood Education: Large Language Mode... agreement between AI-generated assessments and human expert judgments

Interaction2Eval, an LLM-based framework, addresses domain-specific challenges (child speech recognition, Mandarin homophone disambiguation, rubric-based reasoning).

Methodological description in the paper: a specialized LLM-based pipeline designed to handle listed domain challenges; presented as the approach used to extract structured quality indicators.

high positive When AI Meets Early Childhood Education: Large Language Mode... capability to handle domain-specific technical challenges in automated assessmen...

TEPE-TCI-370h is the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations.

Authors' dataset construction and description: 370 hours of recorded interactions from 105 classrooms, annotated with ECQRS-EC and SSTEW rubrics as reported in the paper.

high positive When AI Meets Early Childhood Education: Large Language Mode... availability of a large-scale annotated dataset for preschool teacher-child inte...

These results provide a mechanistic account of how humans adapt their trust in AI confidence signals through experience.

Combined behavioral evidence (N = 200) and computational modeling (LLO + Rescorla–Wagner) presented in the paper.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... mechanistic explanation of trust adaptation to AI confidence signals

The model indicates that humans adapt by updating two components: baseline trust and confidence sensitivity, and they use asymmetric learning rates that prioritize the most informative errors.

Parameter recovery / model-fitting results reported in the paper showing updates to baseline trust and sensitivity parameters and asymmetric learning-rate estimates.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... latent learning parameters (baseline trust, confidence sensitivity, asymmetric l...

A computational model using a linear-in-log-odds (LLO) transformation combined with a Rescorla–Wagner learning rule explains the observed learning dynamics.

Modeling analysis reported in the paper fitting an LLO + Rescorla–Wagner model to participants' behavioral data (N = 200).

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... model fit to behavioral learning dynamics

Humans can compensate for monotonic miscalibration (overconfidence and underconfidence) through repeated experience.

Behavioral experiment results showing participants adapted successfully in overconfidence and underconfidence conditions (N = 200, 50 trials).

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... compensation for monotonic miscalibration (ability to adjust to over/underconfid...

Robust learning occurred across all calibration conditions (standard, overconfidence, underconfidence, reverse) with participants improving accuracy, discrimination, and calibration.

Behavioral experiment (N = 200) reporting consistent learning improvements across the four experimental conditions over 50 trials.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... learning (improvements in accuracy, discrimination, calibration) across conditio...

Participants significantly improved their calibration alignment (alignment between their confidence predictions and actual AI correctness) over 50 trials.

Behavioral experiment (N = 200) reporting improvements in calibration alignment metrics across trials.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... calibration alignment (match between predicted confidence and AI correctness)

Participants significantly improved their discrimination (ability to distinguish correct vs. incorrect AI outputs) over 50 trials.

Behavioral experiment (N = 200) reporting improved discrimination metrics across repeated trials.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... discrimination (ability to separate correct from incorrect AI outputs)

Participants significantly improved their prediction accuracy of the AI's correctness over 50 trials.

Behavioral experiment (N = 200), longitudinal measurement across 50 trials reporting statistically significant improvement in accuracy.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... accuracy (participants' correctness in predicting AI correctness)

All data and models are publicly released.

Statement in abstract asserting public release of datasets and models.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... public availability of data and models

CUA-Suite's rich multimodal corpus supports emerging research directions including generalist screen parsing, continuous spatial control, video-based reward modeling, and visual world models.

Authors' claim about potential use-cases and research enabled by the dataset; forward-looking/qualitative statement.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... support for various research directions (capability to enable research)

CUA-Suite provides two complementary resources: UI-Vision, a rigorous benchmark for evaluating grounding and planning capabilities in CUAs, and GroundCUA, a large-scale grounding dataset with 56K annotated screenshots and over 3.6 million UI element annotations.

Dataset/benchmark description in paper: UI-Vision benchmark and GroundCUA counts (56,000 screenshots, >3,600,000 UI element annotations).

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... size and scope of GroundCUA (annotated screenshots and UI element annotations) a...

Continuous video streams preserve the full temporal dynamics of human interaction, forming a superset of information that can be losslessly transformed into the formats required by existing agent frameworks (unlike sparse datasets that capture only final click coordinates).

Argument made in paper contrasting continuous video to sparse screenshots/final click coordinates; conceptual/logical claim about information content and transformability.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... information content and transformability of continuous video vs. sparse data

VideoCUA provides approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, kinematic cursor traces, and multi-layered reasoning annotations, totaling approximately 55 hours and 6 million frames of expert video.

Dataset description and counts reported in paper: ~10,000 tasks, 87 applications, 30 fps, ~55 hours, ~6,000,000 frames, plus annotation modalities.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... size and modality coverage of the VideoCUA dataset (tasks, hours, frames, annota...

Recent work emphasizes that continuous video, not sparse screenshots, is the critical missing ingredient for scaling these agents.

Cites/references recent literature (stated in abstract) asserting the importance of continuous video over sparse screenshots.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... importance of continuous video vs. sparse screenshots for scaling CUAs

Computer-use agents (CUAs) hold great promise for automating complex desktop workflows.

Statement in paper's introduction/abstract; conceptual claim based on prior literature and motivation for the work.

high positive CUA-Suite: Massive Human-annotated Video Demonstrations for ... promise/ability to automate complex desktop workflows

The framework is designed for direct application to engineering processes for which operational event logs are available.

Statement of intended applicability in the paper and demonstration on a large enterprise procurement workflow (BPI 2019 log).

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... adoptability / applicability to engineering processes

The same quantities that delimit statistically credible autonomy (blind masses, escalation gate, m(s), etc.) also determine expected oversight burden (the framework includes an expected oversight-cost identity over the workflow visitation measure).

Theoretical identity and discussion in the paper plus demonstration on the empirical workflow showing how the introduced quantities relate to expected oversight costs.

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... expected oversight burden / oversight cost

On the held-out split, m(s) = max_a \hat{\pi}(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average.

Empirical evaluation on the paper's held-out test split (chronological 20%); reported average discrepancy between the maximum predicted action probability and realized autonomous-step accuracy.

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... accuracy of autonomous step selection (realized autonomous step accuracy)

Refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668.

Empirical report in the paper showing state-space expansion when additional contextual variables are included in state definition (numbers 42 and 668 stated).

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... other

We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process.

Empirical instantiation described in the paper using the BPI 2019 purchase-to-pay event log; dataset statistics (cases, events, distinct actions) and an 80/20 chronological train/test split are reported.

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... other

We develop a measure-theoretic Markov framework for agentic AI in organizations, whose core quantities are state blind-spot mass B_n(\tau), state-action blind mass B^{SA}_{\pi,n}(\tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure.

Theoretical development presented in the paper (definition and derivation of the measure-theoretic Markov framework and associated quantities).

high positive The Stochastic Gap: A Markovian Framework for Pre-Deployment... other

The framework aims to support more comparable benchmarks and cumulative research on human-AI readiness, advancing safer and more accountable human-AI collaboration.

Stated aims and intended impact in paper; aspirational/conceptual rather than empirically demonstrated in excerpt.

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... benchmarks, cumulative research, safety and accountability in human-AI collabora...

Operationalizing evaluation through interaction traces rather than model properties or self-reported trust enables deployment-relevant assessment of calibration, error recovery, and governance.

Methodological claim/proposed approach in paper; presented as enabling assessment but no empirical evaluation reported in excerpt.

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... assessment of calibration, error recovery, governance via interaction traces

The taxonomy and metrics are connected to the Understand-Control-Improve (U-C-I) lifecycle of human-AI onboarding and collaboration.

Conceptual mapping described in paper; no empirical tests or sample reported in excerpt.

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... linking metrics to U-C-I onboarding lifecycle

We introduce a four part taxonomy of evaluation metrics spanning outcomes, reliance behavior, safety signals, and learning over time.

Explicit methodological claim in paper announcing a taxonomy; described as a contribution rather than empirically tested in excerpt.

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... evaluation metrics taxonomy (outcomes, reliance behavior, safety signals, learni...

This paper proposes a measurement framework for evaluating human-AI decision-making centered on team readiness.

Methodological contribution presented in paper; conceptual framework proposed (no empirical validation reported in excerpt).

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... team readiness evaluation

Artificial intelligence (AI) systems are deployed as collaborators in human decision-making.

Statement in paper (conceptual/observational claim); no empirical sample or method provided in excerpt.

high positive From Accuracy to Readiness: Metrics and Benchmarks for Human... deployment of AI as collaborators

A governance model linking 'trustworthy AI' practices to competitive advantage yields reduced uncertainty, faster deployment cycles, and higher stakeholder trust.

Central claim of the paper tying the proposed AIGSF to business benefits; supported by conceptual linkage and illustrative examples rather than quantified empirical evidence or controlled evaluation.

high positive Artificial Intelligence Governance In Corporate Strategy: Et... firm_revenue

Case illustrations across hiring, credit, consumer services, and generative AI draw lessons on controls such as model documentation, algorithmic audits, impact assessments, and human-in-the-loop oversight.

Paper includes qualitative case illustrations in the listed domains to demonstrate governance controls; these are presented as examples and lessons rather than as systematic empirical studies (no sample sizes reported).

high positive Artificial Intelligence Governance In Corporate Strategy: Et... regulatory_compliance

The paper develops an AI Governance Strategic Framework (AIGSF) and an implementation roadmap that connect ethical accountability, regulatory readiness, cybersecurity resilience, and performance outcomes.

Paper contribution described as an integrative conceptual framework and roadmap; supported by theoretical grounding and illustrative cases rather than empirical validation; no sample size provided.

high positive Artificial Intelligence Governance In Corporate Strategy: Et... organizational_efficiency

AI governance should be treated as a strategic governance function—anchored in board oversight and enterprise risk management—rather than a narrow technical or compliance task.

Central normative recommendation and thesis of the paper; derived from an integrative conceptual framework grounded in corporate governance theory, ERM, and emerging regulation. No empirical testing or sample reported.

high positive Artificial Intelligence Governance In Corporate Strategy: Et... governance_and_regulation

AI has moved from a peripheral digital capability to a central driver of corporate strategy, reshaping decision-making, customer engagement, operations, and risk exposure.

Statement presented in the paper's introduction and motivation; supported by integrative conceptual design and literature grounding (theory and descriptive citations). No empirical sample or quantitative analysis reported.

high positive Artificial Intelligence Governance In Corporate Strategy: Et... organizational_efficiency

A policy of 20% mandatory practice preserves 92% more capability than the simulation baseline (baseline includes a 5% background AI-failure rate).

Simulation comparing baseline (5% background AI-failure rate) to a counterfactual with 20% mandatory practice; reported 92% relative preservation of capability.

high positive The enrichment paradox: critical capability thresholds and i... preserved human capability under mandatory practice policy vs baseline

The model predicts that periodic AI failures improve human capability 2.7-fold (relative improvement reported in simulations).

Simulation experiments comparing scenarios with/without periodic AI failures; reported fold-change in capability of 2.7×.

high positive The enrichment paradox: critical capability thresholds and i... human capability (H) under periodic AI-failure regime

Validated against 15 countries' PISA data (102 points), the model achieves R^2 = 0.946 with 3 parameters and attains the lowest BIC among compared specifications.

Empirical validation using PISA dataset covering 15 countries and 102 data points; reported fit statistics (R^2, number of parameters, BIC).

high positive The enrichment paradox: critical capability thresholds and i... fit of model to PISA data (explained variance, model selection via BIC)

The model was calibrated to four domains: education, medicine, navigation, and aviation.

Model calibration procedures applied separately to four named domains reported in the paper.

high positive The enrichment paradox: critical capability thresholds and i... model parameter fits across domains

We present a two-variable dynamical systems model coupling capability (H) and delegation (D), grounded in three axioms: learning requires capability, practice, and disuse causes forgetting.

Model specification and theoretical construction described in the paper (two-variable dynamical system; three axioms).

high positive The enrichment paradox: critical capability thresholds and i... human capability as a dynamical variable (H) and delegation level (D)

Legal professionals, courts, and regulators should replace the outdated 'black box' mental model with verification protocols based on how these systems actually fail.

Policy recommendation stated in the abstract based on the paper's analysis; no trial or deployment evidence of such protocols provided in the excerpt.

high positive When AI output tips to bad but nobody notices: Legal implica... adoption of verification protocols / change in mental model

The adoption of generative AI across commercial and legal professions offers dramatic efficiency gains.

Asserted in the paper's introduction/abstract; no empirical data, sample, or quantitative study reported in the excerpt.

high positive When AI output tips to bad but nobody notices: Legal implica... efficiency gains

Applying our framework to product listings on Etsy, we find that following ChatGPT's release, listings have significantly more machine-usable information about product selection, consistent with systematic mecha-nudging.

Empirical analysis of Etsy product listings comparing measures of 'machine-usable information about product selection' before and after ChatGPT's release. (The abstract states a significant increase; full paper presumably contains dataset details and statistical tests, but sample size and exact estimates are not provided in the excerpt.)

high positive Mecha-nudges for Machines machine-usable information about product selection

The paper provides recommendations for designing strategic indicators to drive adoption, foster innovation, and objectively assess whether digital tools are delivering top-line impact.

Descriptive claim about the content of the perspective article (the authors state they provide these recommendations); the excerpt itself summarizes this contribution.

high positive Strategic Key Performance Indicators for AI in Lead Optimiza... existence of recommended strategic KPIs intended to affect adoption, innovation,...

The shift from expert-driven computer-aided drug design (CADD) to semiautonomous AI necessitates a new framework of impact-oriented KPIs.

Stated by the EFMC2 community authors as a normative conclusion in the perspective piece; based on the characterisation of a technological shift rather than on presented empirical tests in the excerpt.

high positive Strategic Key Performance Indicators for AI in Lead Optimiza... need for new KPI frameworks to assess impact of semiautonomous AI in drug discov...

« Prev 1 2 3 … 18 19 20 … 62 63 Next »