Evidence (8066 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	417	113	67	480	1091
Governance & Regulation	419	202	124	64	823
Research Productivity	261	100	34	303	703
Organizational Efficiency	406	96	71	40	616
Technology Adoption Rate	323	128	74	38	568
Firm Productivity	307	38	70	12	432
Output Quality	260	71	27	29	387
AI Safety & Ethics	118	179	45	24	368
Market Structure	107	128	85	14	339
Decision Quality	177	75	37	19	312
Fiscal & Macroeconomic	89	58	33	22	209
Employment Level	74	34	78	9	197
Skill Acquisition	98	36	40	9	183
Innovation Output	121	12	24	13	171
Firm Revenue	98	35	24	—	157
Consumer Welfare	73	31	37	7	148
Task Allocation	87	16	34	7	144
Inequality Measures	25	76	32	5	138
Regulatory Compliance	54	61	13	3	131
Task Completion Time	89	7	4	3	103
Error Rate	44	51	6	—	101
Training Effectiveness	58	12	12	16	99
Worker Satisfaction	47	33	11	7	98
Wages & Compensation	54	15	20	5	94
Team Performance	47	12	15	7	82
Automation Exposure	27	26	10	6	72
Job Displacement	6	39	13	—	58
Hiring & Recruitment	40	4	6	3	53
Developer Productivity	34	4	3	1	42
Social Protection	22	11	6	2	41
Creative Output	16	7	5	1	29
Labor Share of Income	12	6	9	—	27
Skill Obsolescence	3	20	2	—	25
Worker Turnover	10	12	—	3	25

Grok attracts users primarily for its content policy.

Survey items asking users for reasons they use each platform; reported attribution of content policy as primary reason for Grok (overall N=388).

high positive Beyond Benchmarks: How Users Evaluate AI Chat Assistants reported adoption reason for Grok (content policy)

DeepSeek attracts users primarily through word-of-mouth.

Survey items asking users for reasons they use each platform; reported attribution of word-of-mouth as primary reason for DeepSeek (overall N=388).

high positive Beyond Benchmarks: How Users Evaluate AI Chat Assistants reported adoption reason for DeepSeek (word-of-mouth)

Claude attracts users primarily for answer quality.

Survey items asking users for reasons they use each platform; reported attribution of answer quality as primary reason for Claude (overall N=388).

high positive Beyond Benchmarks: How Users Evaluate AI Chat Assistants reported adoption reason for Claude (answer quality)

ChatGPT attracts users primarily for its interface.

Survey items asking users for reasons they use each platform; reported attribution of interface as primary reason for ChatGPT (overall N=388).

high positive Beyond Benchmarks: How Users Evaluate AI Chat Assistants reported adoption reason for ChatGPT (interface)

Over 80% of users use two or more platforms (i.e., multi-platform usage is common).

Survey self-reports aggregated across respondents (paper reports 'over 80%'); overall sample N=388.

high positive Beyond Benchmarks: How Users Evaluate AI Chat Assistants number/proportion of users using multiple platforms

We conducted a cross-platform survey of 388 active AI chat users comparing satisfaction, adoption drivers, use case performance, and qualitative frustrations across seven major platforms: ChatGPT, Claude, Gemini, DeepSeek, Grok, Mistral, and Llama.

Cross-sectional online survey described in the paper; sample size reported as 388 users; seven named platforms explicitly listed.

high positive Beyond Benchmarks: How Users Evaluate AI Chat Assistants survey sample and platform coverage

The authors call for shifting evaluation and assurance from tool qualification toward workflow qualification to achieve trustworthy Physical AI.

Normative recommendation based on the paper's theoretical analysis (policy/recommendation; no empirical sample reported).

high positive The Competence Shadow: Theory and Bounds of AI Assistance in... governance_and_regulation

The paper derives non-degradation conditions that characterize shadow-resistant workflows for AI-assisted safety analysis.

Analytic derivations and formal criteria presented in the paper (theoretical result; no empirical validation/sample size reported).

high positive The Competence Shadow: Theory and Bounds of AI Assistance in... output_quality

The paper formalizes four canonical human–AI collaboration structures and derives closed-form performance bounds for them.

Theoretical/mathematical derivations and models in the paper (no empirical verification/sample size reported).

high positive The Competence Shadow: Theory and Bounds of AI Assistance in... task_allocation

A five-dimensional competence framework captures safety competence via domain knowledge, standards expertise, operational experience, contextual understanding, and judgment.

Theoretical contribution: paper defines and formalizes a five-dimension framework (no empirical validation/sample size reported).

high positive The Competence Shadow: Theory and Bounds of AI Assistance in... skill_acquisition

Robustness tests confirm that the core conclusions about IRs improving urban energy resilience and the identified mechanisms/moderators are highly reliable.

Multiple robustness checks reported by the authors (unspecified in the abstract) applied to the DML estimates on the 280-city panel (2009–2023).

high positive Does the Application of Industrial Robots Enhance Urban Ener... robustness of estimated effects on urban energy resilience

Science expenditure (SE) positively moderates the promoting effect of IRs on urban energy resilience; the interaction term coefficient is significantly positive.

Moderation analysis reported in the paper using interaction terms between IRs and science expenditure in the DML framework on the 280-city panel (2009–2023); reported statistically significant positive interaction coefficient.

high positive Does the Application of Industrial Robots Enhance Urban Ener... urban energy resilience (moderation by science expenditure)

Environmental regulation (ER) positively moderates the promoting effect of IRs on urban energy resilience; the interaction term coefficient is significantly positive.

Moderation analysis reported in the paper using interaction terms between IRs and environmental regulation in the DML framework on the 280-city panel (2009–2023); reported statistically significant positive interaction coefficient.

high positive Does the Application of Industrial Robots Enhance Urban Ener... urban energy resilience (moderation by environmental regulation)

Green technology innovation is a main mediating path through which IRs improve urban energy resilience.

Mediation/transmission mechanism analysis reported in the paper based on the DML approach applied to the 280-city panel (2009–2023).

high positive Does the Application of Industrial Robots Enhance Urban Ener... urban energy resilience (mediated by green technology innovation)

Industrial structure upgrading is a main mediating path through which IRs improve urban energy resilience.

Mediation/transmission mechanism analysis reported in the paper based on the same DML framework and the 280-city panel (2009–2023).

high positive Does the Application of Industrial Robots Enhance Urban Ener... urban energy resilience (mediated by industrial structure upgrading)

Industrial robots (IRs) significantly promote the improvement of urban energy resilience (UER).

Empirical analysis using Double Machine Learning (DML) on a panel of 280 prefecture-level and above Chinese cities from 2009 to 2023; various robustness tests reported.

high positive Does the Application of Industrial Robots Enhance Urban Ener... urban energy resilience

To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available.

Statement in paper that testing protocols and materials are documented and released publicly (paper claims to provide materials).

high positive Evaluating Language Models for Harmful Manipulation availability of testing protocols and materials

We assess an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India).

Reported sample size and study design details stated in abstract: N = 10,101; three domains and three locales specified.

high positive Evaluating Language Models for Harmful Manipulation sample composition and scale of the empirical study

This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies.

Paper describes a proposed evaluation framework (methodological contribution); claimed in abstract/introduction as new contribution. No numeric sample required for the claim itself.

high positive Evaluating Language Models for Harmful Manipulation existence of an evaluation framework for harmful AI manipulation

The result is evidence-based triggers that replace calendar schedules and make governance auditable.

Claimed outcome of applying the decision-theoretic framework in the paper (argumentative; no empirical deployment or case-study evidence reported in the summary).

high positive Retraining as Approximate Bayesian Inference retraining trigger design and governance auditability

The paper provides a decision-theoretic framework for retraining policies.

Explicit claim about the paper's contribution; the article presents a framework (conceptual/methodological exposition).

high positive Retraining as Approximate Bayesian Inference existence of a prescriptive framework for retraining policies

The retraining decision is a cost minimization problem with a threshold that falls out of your loss function.

Decision-theoretic derivation presented in the paper (analytical/theoretical reasoning; no empirical validation reported).

high positive Retraining as Approximate Bayesian Inference formalization of retraining decision rule (cost-minimization/threshold)

Retraining can be better understood as approximate Bayesian inference under computational constraints.

Theoretical argument and decision-theoretic framing presented in the paper (conceptual/mathematical derivation rather than empirical testing).

high positive Retraining as Approximate Bayesian Inference conceptual framing of retraining

The analysis was pre-registered and code and data are publicly available.

Authors' statement in the abstract/paper declaring pre-registration and public release of code and data.

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... research transparency (pre-registration and public code/data)

The meta-d' framework reveals which models 'know what they don't know' versus which merely appear well-calibrated due to criterion placement — a distinction with direct implications for model selection, deployment, and human-AI collaboration.

Interpretation and implications drawn from empirical results showing dissociations between calibration metrics and metacognitive measures (meta-d', M-ratio, criterion shifts); argument that this distinction informs practical decisions about model use.

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... distinction between true metacognitive capacity and apparent calibration driven ...

We applied this framework to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials.

Experimental methods reported in the paper listing the four model variants and total trial count (224,000 factual QA trials).

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... empirical evaluation of models' Type-1 and Type-2 metrics across factual QA tria...

We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio.

Methodological contribution described in the paper: specification of a Type-2 SDT framework and use of meta-d' and M-ratio as measurement constructs.

high positive Do LLMs Know What They Know? Measuring Metacognitive Efficie... decomposition of Type-1 vs Type-2 capacities using meta-d' and M-ratio

The best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search.

Analysis comparing origins of the best final designs vs. their ILP ranking, reported across the benchmark set (12).

high positive Agent Factories for High Level Synthesis: How Far Can Genera... origin/ranking of best designs relative to ILP candidates

Larger gains on harder benchmarks: streamcluster exceeds 20× and kmeans reaches approximately 10×.

Per-benchmark empirical results reported for streamcluster and kmeans in the evaluation.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... execution/performance speedup for specific benchmarks

Scaling from 1 to 10 agents yields a mean 8.27× speedup over baseline.

Empirical evaluation across the reported benchmark set comparing performance with 1 agent versus 10 agents; mean speedup stated in the results.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... execution/performance speedup relative to baseline

We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus 4.5/4.6) with AMD Vitis HLS.

Experimental setup described in the paper reporting evaluation on 12 kernels drawn from HLS-Eval and Rodinia-HLS, using Claude Code (Opus 4.5/4.6) and AMD Vitis HLS.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... evaluation dataset and toolchain used

In Stage 2, the pipeline launches N expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition.

Method section describing Stage 2 which runs multiple expert agents exploring cross-function optimizations on top ILP solutions.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... description of Stage 2 expert-agent exploration of cross-function optimizations

In Stage 1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint.

Method section describing Stage 1 decomposition, per-sub-kernel optimization and ILP assembly under an area constraint.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... description of Stage 1 decomposition and ILP-based assembly

We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents.

Method description in the paper describing the design and implementation of the two-stage 'agent factory' pipeline.

high positive Agent Factories for High Level Synthesis: How Far Can Genera... existence and design of the two-stage agent factory pipeline

Deployment validation across 43 classrooms demonstrated an 18x efficiency gain in the assessment workflow.

Field deployment described in the paper: system was validated across 43 classrooms and an efficiency gain of 18x in the assessment workflow is reported.

high positive When AI Meets Early Childhood Education: Large Language Mode... efficiency of the assessment workflow (time/resources per assessment)

Interaction2Eval achieves up to 88% agreement with human expert judgments.

Reported evaluation results comparing Interaction2Eval outputs to human expert annotations (rubric-based judgments) on the dataset.

high positive When AI Meets Early Childhood Education: Large Language Mode... agreement between AI-generated assessments and human expert judgments

Interaction2Eval, an LLM-based framework, addresses domain-specific challenges (child speech recognition, Mandarin homophone disambiguation, rubric-based reasoning).

Methodological description in the paper: a specialized LLM-based pipeline designed to handle listed domain challenges; presented as the approach used to extract structured quality indicators.

high positive When AI Meets Early Childhood Education: Large Language Mode... capability to handle domain-specific technical challenges in automated assessmen...

TEPE-TCI-370h is the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations.

Authors' dataset construction and description: 370 hours of recorded interactions from 105 classrooms, annotated with ECQRS-EC and SSTEW rubrics as reported in the paper.

high positive When AI Meets Early Childhood Education: Large Language Mode... availability of a large-scale annotated dataset for preschool teacher-child inte...

The dataset provides a reproducible and scalable foundation for research on technological diffusion, regional digitalisation, and industry-level transformation, and can be readily extended to future years or adapted to other countries.

Text asserts reproducibility, scalability, and extendability of the dataset and methods for future years and other countries.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

By providing indicators for two benchmark years, the dataset supports the study of how AI adoption evolves across the Spanish business landscape.

Text highlights the availability of indicators for 2023 and 2025 and claims this supports temporal study of adoption evolution.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

This multi-dimensional structure enables users to explore territorial patterns, sectoral differences, and size-related disparities in the uptake of AI.

Text claims that the dataset's dimensions make it possible to explore spatial (territorial), sectoral, and size-related patterns in AI uptake.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

For each province–sector–size combination, the dataset reports whether firms adopt AI, whether they apply it internally, whether it is embedded in their offerings, and how many firms have valid website content.

Text explicitly lists the reported indicators at the province–sector–size aggregation level (adoption, internal use, embedded in offerings, count of valid website content).

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

The dataset offers a detailed portrait of AI adoption across regions (NUTS 3), industries, and firm size categories.

Text claims multi-dimensional reporting by region (NUTS 3), industry, and firm size categories in the dataset.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

The pipeline identifies explicit evidence of AI use both in firms' internal processes and embedded in their products or services.

Text states the structured rubric is used to identify explicit evidence of AI use in internal processes and in products/services.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

The paper uses a systemic pipeline based on large language models (LLMs) to segment website text, semantically filter it, and evaluate it with a structured rubric.

Text describes methodological pipeline components (LLM-based segmentation, semantic filtering, structured rubric evaluation).

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... other

The dataset results in 225,628 firm-year observations.

Text explicitly reports 225,628 firm-year observations derived from the dataset across the two benchmark years.

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

The paper introduces a nationwide dataset that maps how 112,814 Spanish firms communicate and implement artificial intelligence (AI) on their corporate websites in 2023 and 2025.

Text states dataset coverage and firm count (112,814 firms) and benchmark years (2023 and 2025).

high positive AI adoption in Spain (2023–2025): A web-derived dataset base... adoption_rate

These results provide a mechanistic account of how humans adapt their trust in AI confidence signals through experience.

Combined behavioral evidence (N = 200) and computational modeling (LLO + Rescorla–Wagner) presented in the paper.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... mechanistic explanation of trust adaptation to AI confidence signals

The model indicates that humans adapt by updating two components: baseline trust and confidence sensitivity, and they use asymmetric learning rates that prioritize the most informative errors.

Parameter recovery / model-fitting results reported in the paper showing updates to baseline trust and sensitivity parameters and asymmetric learning-rate estimates.

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... latent learning parameters (baseline trust, confidence sensitivity, asymmetric l...

A computational model using a linear-in-log-odds (LLO) transformation combined with a Rescorla–Wagner learning rule explains the observed learning dynamics.

Modeling analysis reported in the paper fitting an LLO + Rescorla–Wagner model to participants' behavioral data (N = 200).

high positive Learning to Trust: How Humans Mentally Recalibrate AI Confid... model fit to behavioral learning dynamics

« Prev 1 2 3 … 54 55 56 … 161 162 Next »