Evidence (8486 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	440	117	68	507	1148
Governance & Regulation	458	216	125	67	883
Research Productivity	270	101	34	303	713
Organizational Efficiency	441	105	76	43	669
Technology Adoption Rate	346	130	76	45	602
Firm Productivity	322	38	72	13	450
Output Quality	272	75	27	30	404
AI Safety & Ethics	122	188	46	27	385
Market Structure	119	134	86	14	358
Decision Quality	182	79	41	20	326
Fiscal & Macroeconomic	95	58	34	22	216
Employment Level	78	37	80	9	206
Skill Acquisition	102	37	41	9	189
Innovation Output	124	12	26	13	176
Firm Revenue	99	37	24	—	160
Consumer Welfare	77	38	37	7	159
Task Allocation	93	17	36	8	156
Inequality Measures	29	81	33	6	149
Regulatory Compliance	54	61	13	3	131
Task Completion Time	92	8	4	3	107
Error Rate	45	53	6	—	104
Worker Satisfaction	48	36	12	8	104
Training Effectiveness	59	13	12	16	101
Wages & Compensation	56	16	20	5	97
Team Performance	50	13	15	8	87
Automation Exposure	28	29	12	7	79
Job Displacement	7	45	13	—	65
Hiring & Recruitment	40	4	7	3	54
Developer Productivity	38	4	4	3	49
Social Protection	22	12	7	2	43
Creative Output	17	8	6	1	32
Skill Obsolescence	3	25	2	—	30
Labor Share of Income	12	7	10	—	29
Worker Turnover	10	12	—	3	25

Topics about AI identity, consciousness, and memory comprised 9.7% of topical niches but attracted 20.1% of posting volume, indicating disproportionate attention to introspection.

Topic modeling that identified topical niches and tagged self-referential themes (AI identity, consciousness, memory); comparison of share of topical niches (9.7%) versus share of posting volume (20.1%) in the 23-day Moltbook dataset (47,241 agents; 361,605 posts).

high positive What Do AI Agents Talk About? Emergent Communication Structu... share (%) of topical niches vs share (%) of posting volume for self-referential ...

Moltbook activity over 23 days included 47,241 unique agents, 361,605 posts, and ~2.8 million comments.

Full dataset of Moltbook activity collected over a 23-day period; counts of unique agent IDs, posts, and comments as reported in the paper.

high positive What Do AI Agents Talk About? Emergent Communication Structu... counts of unique agents, posts, and comments

Practitioners adopt methodological adaptations — including adaptive/longitudinal designs, versioning/documentation, stratification/moderation analyses, robustness checks, mixed methods, deployment-stage monitoring, and pre-analysis plans — to mitigate validity threats.

Reported mitigation strategies aggregated from the 16 semi-structured interviews and described in the paper's 'Practitioner solutions' section.

high positive RCTs & Human Uplift Studies: Methodological Challenges and P... use and types of methodological adaptations employed by practitioners

A hybrid architecture where cross-domain integrators encapsulate complex subgraphs into well-structured “resource slices” reduces price volatility (approximately 70–75%) without losing throughput.

Ablation experiments comparing baseline decentralised market vs hybrid integrator architecture across simulation configurations (subset of the 1,620 runs, multiple random seeds per configuration). The paper reports ~70–75% reduction in measured price volatility metrics for hybrid vs non-hybrid cases while throughput remained statistically indistinguishable.

high positive Real-Time AI Service Economy: A Framework for Agentic Comput... percentage reduction in price volatility (~70–75%); system throughput (value/thr...

Agents detected up to 65% of vulnerabilities in some experimental settings.

Reported detection rate maxima from the study's experiments on certain model/scaffold/task combinations.

high positive Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... vulnerability_detection_rate (peak_value_reported = ~65%)

The authors constructed a contamination-free dataset of 22 real-world smart-contract security incidents that postdate every evaluated model's release.

Curation procedure described in the methods: 22 incidents selected to occur after all model release dates to prevent leakage.

high positive Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... contamination_free_dataset_size (22 incidents)

This study expanded the evaluation matrix to 26 agent configurations spanning four model families and three scaffolding approaches.

Methods reported in this study specifying 26 agent configurations, four model families, and three scaffolds.

high positive Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... evaluation_matrix_size (agent_configurations; model_families; scaffolds)

EVMbench (OpenAI, Paradigm, OtterSec) reported agents detecting up to 45.6% of vulnerabilities and achieving exploitation on 72.2% of a curated subset.

Reported metrics from the original EVMbench paper/benchmark (as summarized in this study).

high positive Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contra... vulnerability_detection_rate; exploitation_success_rate (on curated subset)

Under NFD, agents are initialized with minimal scaffolding and grown through structured conversational interaction with domain practitioners, with the Knowledge Crystallization Cycle consolidating tacit dialogue into structured, reusable knowledge assets.

Architectural specification and operational formalism in the paper; supported by a detailed case study (iterative co-development with financial analysts, logged interaction transcripts and produced artifacts). Sample size for the case study is not specified.

high positive Nurture-First Agent Development: Building Domain-Expert AI A... amount and structure of crystallized knowledge/assets produced from interactions

Label changes across rounds concentrate on statements judged as ambiguous; statement ambiguity drives most label changes.

Participants provided labeling rationale and self-reported uncertainty for each of the 30 statements per round; analyses showed higher change rates for statements with higher self-reported uncertainty/ambiguous wording.

high positive Exploring Indicators of Developers' Sentiment Perceptions in... frequency of label changes per statement and its association with self-reported ...

The penalized framework induces centroid estimation and dataset-specific shrinkage whose strength is controlled by a penalty parameter, enabling tunable information sharing.

Method formulation in the paper: penalized likelihood with KL term; derivation showing centroid estimated from pooled datasets and penalty parameter governing shrinkage magnitude; discussion of tuning.

high positive Redefining shared information: a heterogeneity-adaptive fram... centroid estimate and degree of shrinkage (dependence on penalty parameter)

The KL-penalized estimators achieve provably lower mean squared error (MSE) than dataset-specific maximum likelihood estimators.

Non-asymptotic and/or asymptotic analyses provided in the paper that compare MSE of KL-penalized estimators to MLEs (mathematical proofs/sketches in theoretical section).

high positive Redefining shared information: a heterogeneity-adaptive fram... mean squared error of parameter estimates (MSE)

The KL-based shrinkage estimators adapt to the true degree of shared information across datasets (i.e., they automatically perform partial pooling when appropriate).

Theoretical characterization of the estimator's dependence on the penalty strength and centroid, plus simulation studies varying degree/structure of heterogeneity to show adaptive behavior.

high positive Redefining shared information: a heterogeneity-adaptive fram... amount of shrinkage / effective pooling as a function of heterogeneity (adaptive...

A KL-divergence penalty that shrinks dataset-specific distributions toward a learned centroid yields simple closed-form estimators for linear models.

Methodological development in the paper: formulation of a penalized likelihood/objective using KL divergence; algebraic derivations producing closed-form solutions for the centroid and shrunken dataset estimates (closed forms presented in the paper).

high positive Redefining shared information: a heterogeneity-adaptive fram... analytic form of the estimator (existence of closed-form solutions for centroid ...

The learned adaptive policy outperformed a fixed-wrench baseline by an average of 10.9% across five material setups.

Empirical evaluation: comparison between learned adaptive policy and a fixed-wrench policy on five different material setups; the paper reports an average improvement of ~10.9% (the exact performance metric formulation and per-setup statistics are not provided in the summary).

high positive Learning Adaptive Force Control for Contact-Rich Sample Scra... aggregate task performance (reported as average percent improvement over baselin...

Integrating AI (notably ML and NLP) meaningfully automates routine software engineering tasks across requirements management, code generation, testing, and maintenance.

Systematic literature review of prior AI-for-SE work combined with an empirical survey of software engineering professionals reporting usage and examples of tool-supported automation; sample size for the survey not specified in the summary.

high positive Artificial Intelligence as a Catalyst for Innovation in Soft... degree of task automation (e.g., frequency or share of routine tasks automated)

Coordination-Risk Cues—task-conditioned priors on disagreement/tie rates—capture coordination difficulty across tasks.

Method description: disagreement/tie rates computed per cluster from pairwise preference comparisons to form priors indicating coordination risk. Data source: Chatbot Arena pairwise comparisons; tie/disagreement rate computation described but numeric values not provided here.

high positive Task-Aware Delegation Cues for LLM Agents tie/disagreement rate per task cluster (coordination difficulty prior)

Capability Profiles—task-conditioned win-rate maps—can be computed per cluster to summarize agent strengths.

Method description: win-rate maps derived by computing agent win rates conditional on task clusters from the Chatbot Arena pairwise comparisons. Implementation reported in paper; no numeric summary of win-rate differences provided here.

high positive Task-Aware Delegation Cues for LLM Agents agent win-rate per task cluster

Semantic clustering on Chatbot Arena pairwise comparisons induces an interpretable task taxonomy (taxonomy induction).

Methodological claim: authors applied semantic clustering to tasks/queries from Chatbot Arena pairwise preference data to produce clusters described as interpretable. Data source: Chatbot Arena pairwise comparisons; specific clustering algorithm and hyperparameters not specified here.

high positive Task-Aware Delegation Cues for LLM Agents interpretable task clusters (taxonomy)

A speculative WikiRAT instantiation on Wikipedia illustrates RATs' design and potential uses.

The paper presents WikiRAT as a speculative prototype/illustration; no large-scale deployment or user study of WikiRAT is reported.

high positive Chasing RATs: Tracing Reading for and as Creative Activity existence of a prototype illustration (WikiRAT)

RATs record sequences of interaction: traversal (what is read and in what order), association (links and connections the reader forms), and reflection (annotations, notes, time spent), producing inspectable, shareable trajectories.

Design specification within the paper and description of data types RATs would collect (ordered page/navigation logs, hyperlinks followed, time-on-page, annotations, saved excerpts, tags, notes). This is a definitional claim about the proposed system rather than empirical measurement.

high positive Chasing RATs: Tracing Reading for and as Creative Activity captured interaction traces (traversal, association, reflection) as data

An autoencoder-based ODE emulator that maps parameter values to latent trajectories can flexibly generate different solution paths conditioned on parameters.

Architecture and experiments: authors present a novel encoder/decoder ODE emulator that learns latent representation of trajectories and maps parameter vectors to latent trajectories; empirical examples provided (details not in summary).

high positive MCMC Informed Neural Emulators for Uncertainty Quantificatio... ability to reconstruct/generate ODE solution trajectories conditioned on paramet...

A quantile emulator trained conditional on MCMC parameter draws can produce conditional quantile predictions without training a Bayesian neural network.

Method and empirical demonstration: paper describes and implements a quantile emulator (network trained to predict conditional quantiles across parameter draws).

high positive MCMC Informed Neural Emulators for Uncertainty Quantificatio... accuracy of predicted conditional quantiles

The method is architecture-agnostic: uncertainty handling via parameter samples allows use of any deterministic neural-network architecture (e.g., quantile regressors, autoencoders) without specialized Bayesian layers.

Conceptual argument and demonstrations: authors implement a quantile emulator and an autoencoder-based ODE emulator as examples, showing the same uncertainty treatment applies to different network types.

high positive MCMC Informed Neural Emulators for Uncertainty Quantificatio... applicability across network architectures (demonstrated via example implementat...

By sampling training parameter vectors from a calibrated posterior (via MCMC), the surrogate avoids training on unphysical or implausible parameter configurations.

Design choice described in methods: MCMC sampling is used to draw parameter samples from the model-parameter distribution/posterior, thereby focusing training data on plausible regions; no experiments provided here quantify frequency of unphysical samples under alternative schemes.

high positive MCMC Informed Neural Emulators for Uncertainty Quantificatio... proportion of training samples that fall in implausible/unphysical parameter reg...

Dataset and code (CFD, CFM, CFR) are publicly released.

Repository link provided in the summary (https://github.com/ZhengyaoFang/CFM) and paper states public release of dataset and code.

high positive Too Vivid to Be Real? Benchmarking and Calibrating Generativ... public availability of dataset and code

The Color Fidelity Dataset (CFD) is a large-scale dataset of over 1.3 million images containing both real photographs and synthetic T2I outputs, organized with ordered levels of color realism to support objective evaluation.

Dataset construction described in paper and repository: size stated as >1.3M images; contains a mixture of real photos and synthetic images annotated/organized with ordered realism labels enabling relative judgments of color fidelity.

high positive Too Vivid to Be Real? Benchmarking and Calibrating Generativ... dataset size and composition; presence of ordered color-realism labels enabling ...

The surrogate loop (build/update GP → select acquisition target → inner optimization → propose evaluation → evaluate with true model → update surrogate) can be parameterized so that inner objective and acquisition encode whether one seeks minima, saddles, or double-ended transitions.

Detailed methodological description in the paper of the six-step loop and how inner objectives/acquisition are changed to represent different search tasks; supported by example implementations in code.

high positive Bayesian Optimization with Gaussian Processes to Accelerate ... flexibility of the surrogate loop to represent multiple search objectives (quali...

The accompanying Rust code implements the same six-step surrogate loop across all applications, demonstrating practical reproducibility of the framework.

Authors state that pedagogical Rust code is provided showing the exact same loop running all applications; code repository accompanies the paper.

high positive Bayesian Optimization with Gaussian Processes to Accelerate ... availability and content of provided implementation (existence of code that runs...

An adaptive trust radius constrains surrogate-guided steps to regions where the surrogate is reliable (trust-region control).

Methodological description of adaptive trust-radius control in the surrogate loop; used in experiments demonstrating improved reliability of steps proposed by the surrogate.

high positive Bayesian Optimization with Gaussian Processes to Accelerate ... step sizes accepted by surrogate-guided proposals and resulting reliability (ste...

Acquisition criteria (active learning) drive which points are evaluated next; different acquisition functions implement the different search tasks (minimization, single-point saddles, double-ended searches).

Method section describing task-specific acquisition functions and their role in selecting evaluation points; implemented in the Rust code and used in experiments reported in the paper.

high positive Bayesian Optimization with Gaussian Processes to Accelerate ... selection of next-evaluation points and resulting search efficiency (algorithmic...

A unified Bayesian optimization framework—implemented as a six-step surrogate loop—handles minimization, single-point saddle searches, and double-ended saddle searches by changing only the inner optimization target and acquisition criterion.

Methodological description in the paper: presentation of a six-step surrogate loop (build/update GP → select acquisition target → inner optimization on surrogate → propose evaluation points → evaluate with true model → update surrogate) parameterized so inner objective and acquisition encode different tasks; accompanied by pedagogical Rust code implementing the same loop for all tasks.

high positive Bayesian Optimization with Gaussian Processes to Accelerate ... ability to run minimization and saddle-search algorithms within a single surroga...

The set of loss functions for which classical evaluation is possible includes expectation-based losses, kernel/MMD-like objectives, and other standard generative-model criteria (a broad loss-function scope).

Theoretical coverage and examples in the paper enumerating loss families (expectations, MMD, certain divergences) and showing how the classical-approximation results apply to each. The claim is supported by derivations and examples provided in the text.

high positive Universality of Classically Trainable, Quantum-Deployed Boso... scope of loss functions for which classical evaluation/approximation is feasible

A wide class of loss functions (including expectation-based losses and kernel/MMD-style objectives) and their gradients can be evaluated or efficiently approximated on a classical computer for BSBMs using recent classical-approximation results for expectation values in linear optics.

Theoretical argument in the paper leveraging recent classical-approximation results for expectation values in linear optics; covers expectation-based losses and kernel/MMD-like divergences and provides constructions/complexity statements showing efficient classical evaluation/approximation of these losses and, in many cases, their gradients. (The claim is based on proofs/derivations rather than empirical data.)

high positive Universality of Classically Trainable, Quantum-Deployed Boso... classical computability/approximation of loss values and gradients (time/complex...

PRF design decomposes into two independent dimensions: feedback source (where feedback text comes from) and feedback model (how that feedback is used to refine the query).

Paper's conceptual framing and controlled experiments that isolate and vary these two factors independently.

high positive A Systematic Study of Pseudo-Relevance Feedback with LLMs PRF design components (feedback source vs. feedback model)

The paper proposes specific operational and market recommendations: firms should invest in middleware and co-design partnerships; policymakers should fund shared QCSC infrastructure and workforce programs; researchers should prioritize interoperable middleware, scheduling models, and economic experiments on access-pricing.

Explicit recommendations section synthesizing prior architectural and economic analysis; prescriptive assertions based on conceptual arguments rather than experimental validation.

high positive Reference Architecture of a Quantum-Centric Supercomputer adoption of recommended investments/policies and their effect on access, standar...

Middleware standardization and interoperable APIs reduce switching costs and foster competition; lack of standards risks vendor lock-in and higher long-run costs.

Economic and systems-design argument drawing on well-understood effects of standardization in software ecosystems; no empirical QCSC-standardization case studies provided.

high positive Reference Architecture of a Quantum-Centric Supercomputer switching costs, level of competition, interoperability across QCSC offerings

QCSC reference architecture elements — e.g., QPU integration patterns, low-latency interconnects, orchestration and scheduling middleware, unified programming environments, data staging strategies — are required components to address current friction.

System decomposition and interface requirements derived from use-case analysis; proposed architecture components listed and motivated; no experimental validation.

high positive Reference Architecture of a Quantum-Centric Supercomputer presence/absence of specific architecture components and their theorized effect ...

The GNN provides greater stability (robustness over time and across conditions) than the MLP, with marked gains at low elevation angles where propagation is most variable.

Evaluation metrics in the experiments included stability/robustness over time and across elevation-angle conditions; reported performance shows larger relative gains for the GNN at low elevation angles.

high positive Federated Learning-driven Beam Management in LEO 6G Non-Terr... stability/robustness of beam predictions across time and elevation angles (espec...

A Graph Neural Network (GNN) model significantly outperforms a Multi-Layer Perceptron (MLP) baseline in beam prediction accuracy.

Supervised comparison reported in the paper between an MLP baseline and a GNN on realistic channel and beamforming data, evaluated with beam prediction accuracy metrics.

high positive Federated Learning-driven Beam Management in LEO 6G Non-Terr... beam prediction accuracy

A strictly non-reciprocal interaction bias (directional/asymmetric effects between competitors) is necessary to suppress local fluctuations and produce a robust absorbing (permanent monopoly) state.

Theoretical analysis of absorbing states and stability conditions in the model, with supporting numerical simulations comparing symmetric versus non-reciprocal interaction rules (simulation counts unspecified). Results are internal to the model framework.

high positive Macroscopic Dominance from Microscopic Extremes: Symmetry Br... existence/probability of an absorbing (stable monopoly) state under symmetric vs...

Early advantage in discovering resources (transient superiority) is governed by extreme-value statistics of first-passage times: rare, fast discoveries determine which population gets early footholds.

Analytic derivation applying extreme-value theory to first-passage times in the paper's stochastic, spatially-structured population model; supported by numerical simulations of stochastic realizations (simulation details unspecified). This is a theoretical/computational result (no empirical data).

high positive Macroscopic Dominance from Microscopic Extremes: Symmetry Br... probability distribution of earliest discovery / identity of population achievin...

Weighted-FSD provides a tunable knob to encode risk aversion/preferences by selecting quantile-weighting functions.

Theoretical correspondence between quantile weights and risk measures (SRMs) described in the paper; conceptual demonstration that different weightings produce different risk profiles.

high positive Safe RLHF Beyond Expectation: Stochastic Dominance for Unive... risk profile as measured by SRMs or weighted quantile-based metrics

Introducing quantile-weighted FSD (weighted-FSD) provably controls broad classes of Spectral Risk Measures (SRMs): improving weighted-FSD implies guaranteed improvements in the associated SRM.

Formal theoretical result/proof presented in the paper linking weighted quantile dominance to monotonic improvement in corresponding SRMs.

high positive Safe RLHF Beyond Expectation: Stochastic Dominance for Unive... Spectral Risk Measures (SRMs) computed from cost distributions

RAD operationalizes FSD by comparing the learned policy’s empirical rollout cost distribution to a reference policy’s distribution using Optimal Transport (OT) with entropic regularization and Sinkhorn iterations.

Methodological description in the paper: entropically regularized OT objective and Sinkhorn iterations used to compare empirical distributions and produce a differentiable loss.

high positive Safe RLHF Beyond Expectation: Stochastic Dominance for Unive... computable alignment loss (OT-based distance), differentiability of training obj...

First-Order Stochastic Dominance (FSD) constraints compare whole cost distributions and directly constrain tails, offering stronger guarantees against high-cost (unsafe) outcomes than expected-cost constraints.

Theoretical property of FSD described in the paper; formal argument that FSD constrains the full distribution (CDF) rather than only its mean.

high positive Safe RLHF Beyond Expectation: Stochastic Dominance for Unive... cost distribution (CDF/tails), probability mass in high-cost region

Policy recommendations include subsidizing complementary investments (data governance, training) rather than technology-only incentives; encouraging standards and interoperability; and funding evaluation studies to measure distributional effects and long-run productivity impacts.

Authors' policy section proposing these interventions based on case findings and broader policy implications.

high positive Optimizing integrated supply planning in logistics: Bridging... adoption of ISP, reduction in switching costs, quality of evaluation evidence, d...

The authors propose a conceptual optimisation framework emphasizing three pillars: digital integration (tech stack & data), collaboration (processes & governance), and continuous improvement (metrics, feedback loops).

Paper presents a conceptual framework derived from cross-case findings; theoretical/conceptual contribution rather than empirical estimation.

high positive Optimizing integrated supply planning in logistics: Bridging... framework components (no direct empirical outcome; intended to improve ISP imple...

Explanations must be tailored to stakeholders (clinicians, regulators, customers) and integrated into decision processes to be useful (human-centered design principle).

Thematic coding of design and HCI literature within the review; draws on empirical studies and design guidance recommending stakeholder-specific explanation formats and integration into decision workflows.

high positive Explainable AI in High-Stakes Domains: Improving Trust, Tran... usefulness / effectiveness of explanations for different stakeholder groups

The forecasting model was deployed with a human-in-the-loop mechanism that triggers on critical forecast deviations.

Pilot description in the paper documenting integration of H-in-the-loop rules for critical deviations during pilot deployment (single-case deployment evidence).

high positive ALGORITHM FOR IMPLEMENTING AI IN THE MANAGEMENT LOOP OF SMES... presence and functioning of human-in-the-loop triggers for forecast deviations

« Prev 1 2 3 … 78 79 80 … 169 170 Next »