Evidence (8486 claims)
Adoption
5821 claims
Productivity
5033 claims
Governance
4561 claims
Human-AI Collaboration
3600 claims
Labor Markets
2749 claims
Innovation
2687 claims
Org Design
2648 claims
Skills & Training
2107 claims
Inequality
1429 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 440 | 117 | 68 | 507 | 1148 |
| Governance & Regulation | 458 | 216 | 125 | 67 | 883 |
| Research Productivity | 270 | 101 | 34 | 303 | 713 |
| Organizational Efficiency | 441 | 105 | 76 | 43 | 669 |
| Technology Adoption Rate | 346 | 130 | 76 | 45 | 602 |
| Firm Productivity | 322 | 38 | 72 | 13 | 450 |
| Output Quality | 272 | 75 | 27 | 30 | 404 |
| AI Safety & Ethics | 122 | 188 | 46 | 27 | 385 |
| Market Structure | 119 | 134 | 86 | 14 | 358 |
| Decision Quality | 182 | 79 | 41 | 20 | 326 |
| Fiscal & Macroeconomic | 95 | 58 | 34 | 22 | 216 |
| Employment Level | 78 | 37 | 80 | 9 | 206 |
| Skill Acquisition | 102 | 37 | 41 | 9 | 189 |
| Innovation Output | 124 | 12 | 26 | 13 | 176 |
| Firm Revenue | 99 | 37 | 24 | — | 160 |
| Consumer Welfare | 77 | 38 | 37 | 7 | 159 |
| Task Allocation | 93 | 17 | 36 | 8 | 156 |
| Inequality Measures | 29 | 81 | 33 | 6 | 149 |
| Regulatory Compliance | 54 | 61 | 13 | 3 | 131 |
| Task Completion Time | 92 | 8 | 4 | 3 | 107 |
| Error Rate | 45 | 53 | 6 | — | 104 |
| Worker Satisfaction | 48 | 36 | 12 | 8 | 104 |
| Training Effectiveness | 59 | 13 | 12 | 16 | 101 |
| Wages & Compensation | 56 | 16 | 20 | 5 | 97 |
| Team Performance | 50 | 13 | 15 | 8 | 87 |
| Automation Exposure | 28 | 29 | 12 | 7 | 79 |
| Job Displacement | 7 | 45 | 13 | — | 65 |
| Hiring & Recruitment | 40 | 4 | 7 | 3 | 54 |
| Developer Productivity | 38 | 4 | 4 | 3 | 49 |
| Social Protection | 22 | 12 | 7 | 2 | 43 |
| Creative Output | 17 | 8 | 6 | 1 | 32 |
| Skill Obsolescence | 3 | 25 | 2 | — | 30 |
| Labor Share of Income | 12 | 7 | 10 | — | 29 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
Topics about AI identity, consciousness, and memory comprised 9.7% of topical niches but attracted 20.1% of posting volume, indicating disproportionate attention to introspection.
Topic modeling that identified topical niches and tagged self-referential themes (AI identity, consciousness, memory); comparison of share of topical niches (9.7%) versus share of posting volume (20.1%) in the 23-day Moltbook dataset (47,241 agents; 361,605 posts).
Moltbook activity over 23 days included 47,241 unique agents, 361,605 posts, and ~2.8 million comments.
Full dataset of Moltbook activity collected over a 23-day period; counts of unique agent IDs, posts, and comments as reported in the paper.
Practitioners adopt methodological adaptations — including adaptive/longitudinal designs, versioning/documentation, stratification/moderation analyses, robustness checks, mixed methods, deployment-stage monitoring, and pre-analysis plans — to mitigate validity threats.
Reported mitigation strategies aggregated from the 16 semi-structured interviews and described in the paper's 'Practitioner solutions' section.
A hybrid architecture where cross-domain integrators encapsulate complex subgraphs into well-structured “resource slices” reduces price volatility (approximately 70–75%) without losing throughput.
Ablation experiments comparing baseline decentralised market vs hybrid integrator architecture across simulation configurations (subset of the 1,620 runs, multiple random seeds per configuration). The paper reports ~70–75% reduction in measured price volatility metrics for hybrid vs non-hybrid cases while throughput remained statistically indistinguishable.
Agents detected up to 65% of vulnerabilities in some experimental settings.
Reported detection rate maxima from the study's experiments on certain model/scaffold/task combinations.
The authors constructed a contamination-free dataset of 22 real-world smart-contract security incidents that postdate every evaluated model's release.
Curation procedure described in the methods: 22 incidents selected to occur after all model release dates to prevent leakage.
This study expanded the evaluation matrix to 26 agent configurations spanning four model families and three scaffolding approaches.
Methods reported in this study specifying 26 agent configurations, four model families, and three scaffolds.
EVMbench (OpenAI, Paradigm, OtterSec) reported agents detecting up to 45.6% of vulnerabilities and achieving exploitation on 72.2% of a curated subset.
Reported metrics from the original EVMbench paper/benchmark (as summarized in this study).
Under NFD, agents are initialized with minimal scaffolding and grown through structured conversational interaction with domain practitioners, with the Knowledge Crystallization Cycle consolidating tacit dialogue into structured, reusable knowledge assets.
Architectural specification and operational formalism in the paper; supported by a detailed case study (iterative co-development with financial analysts, logged interaction transcripts and produced artifacts). Sample size for the case study is not specified.
Label changes across rounds concentrate on statements judged as ambiguous; statement ambiguity drives most label changes.
Participants provided labeling rationale and self-reported uncertainty for each of the 30 statements per round; analyses showed higher change rates for statements with higher self-reported uncertainty/ambiguous wording.
The penalized framework induces centroid estimation and dataset-specific shrinkage whose strength is controlled by a penalty parameter, enabling tunable information sharing.
Method formulation in the paper: penalized likelihood with KL term; derivation showing centroid estimated from pooled datasets and penalty parameter governing shrinkage magnitude; discussion of tuning.
The KL-penalized estimators achieve provably lower mean squared error (MSE) than dataset-specific maximum likelihood estimators.
Non-asymptotic and/or asymptotic analyses provided in the paper that compare MSE of KL-penalized estimators to MLEs (mathematical proofs/sketches in theoretical section).
The KL-based shrinkage estimators adapt to the true degree of shared information across datasets (i.e., they automatically perform partial pooling when appropriate).
Theoretical characterization of the estimator's dependence on the penalty strength and centroid, plus simulation studies varying degree/structure of heterogeneity to show adaptive behavior.
A KL-divergence penalty that shrinks dataset-specific distributions toward a learned centroid yields simple closed-form estimators for linear models.
Methodological development in the paper: formulation of a penalized likelihood/objective using KL divergence; algebraic derivations producing closed-form solutions for the centroid and shrunken dataset estimates (closed forms presented in the paper).
The learned adaptive policy outperformed a fixed-wrench baseline by an average of 10.9% across five material setups.
Empirical evaluation: comparison between learned adaptive policy and a fixed-wrench policy on five different material setups; the paper reports an average improvement of ~10.9% (the exact performance metric formulation and per-setup statistics are not provided in the summary).
Integrating AI (notably ML and NLP) meaningfully automates routine software engineering tasks across requirements management, code generation, testing, and maintenance.
Systematic literature review of prior AI-for-SE work combined with an empirical survey of software engineering professionals reporting usage and examples of tool-supported automation; sample size for the survey not specified in the summary.
Coordination-Risk Cues—task-conditioned priors on disagreement/tie rates—capture coordination difficulty across tasks.
Method description: disagreement/tie rates computed per cluster from pairwise preference comparisons to form priors indicating coordination risk. Data source: Chatbot Arena pairwise comparisons; tie/disagreement rate computation described but numeric values not provided here.
Capability Profiles—task-conditioned win-rate maps—can be computed per cluster to summarize agent strengths.
Method description: win-rate maps derived by computing agent win rates conditional on task clusters from the Chatbot Arena pairwise comparisons. Implementation reported in paper; no numeric summary of win-rate differences provided here.
Semantic clustering on Chatbot Arena pairwise comparisons induces an interpretable task taxonomy (taxonomy induction).
Methodological claim: authors applied semantic clustering to tasks/queries from Chatbot Arena pairwise preference data to produce clusters described as interpretable. Data source: Chatbot Arena pairwise comparisons; specific clustering algorithm and hyperparameters not specified here.
A speculative WikiRAT instantiation on Wikipedia illustrates RATs' design and potential uses.
The paper presents WikiRAT as a speculative prototype/illustration; no large-scale deployment or user study of WikiRAT is reported.
RATs record sequences of interaction: traversal (what is read and in what order), association (links and connections the reader forms), and reflection (annotations, notes, time spent), producing inspectable, shareable trajectories.
Design specification within the paper and description of data types RATs would collect (ordered page/navigation logs, hyperlinks followed, time-on-page, annotations, saved excerpts, tags, notes). This is a definitional claim about the proposed system rather than empirical measurement.
An autoencoder-based ODE emulator that maps parameter values to latent trajectories can flexibly generate different solution paths conditioned on parameters.
Architecture and experiments: authors present a novel encoder/decoder ODE emulator that learns latent representation of trajectories and maps parameter vectors to latent trajectories; empirical examples provided (details not in summary).
A quantile emulator trained conditional on MCMC parameter draws can produce conditional quantile predictions without training a Bayesian neural network.
Method and empirical demonstration: paper describes and implements a quantile emulator (network trained to predict conditional quantiles across parameter draws).
The method is architecture-agnostic: uncertainty handling via parameter samples allows use of any deterministic neural-network architecture (e.g., quantile regressors, autoencoders) without specialized Bayesian layers.
Conceptual argument and demonstrations: authors implement a quantile emulator and an autoencoder-based ODE emulator as examples, showing the same uncertainty treatment applies to different network types.
By sampling training parameter vectors from a calibrated posterior (via MCMC), the surrogate avoids training on unphysical or implausible parameter configurations.
Design choice described in methods: MCMC sampling is used to draw parameter samples from the model-parameter distribution/posterior, thereby focusing training data on plausible regions; no experiments provided here quantify frequency of unphysical samples under alternative schemes.
Dataset and code (CFD, CFM, CFR) are publicly released.
Repository link provided in the summary (https://github.com/ZhengyaoFang/CFM) and paper states public release of dataset and code.
The Color Fidelity Dataset (CFD) is a large-scale dataset of over 1.3 million images containing both real photographs and synthetic T2I outputs, organized with ordered levels of color realism to support objective evaluation.
Dataset construction described in paper and repository: size stated as >1.3M images; contains a mixture of real photos and synthetic images annotated/organized with ordered realism labels enabling relative judgments of color fidelity.
The surrogate loop (build/update GP → select acquisition target → inner optimization → propose evaluation → evaluate with true model → update surrogate) can be parameterized so that inner objective and acquisition encode whether one seeks minima, saddles, or double-ended transitions.
Detailed methodological description in the paper of the six-step loop and how inner objectives/acquisition are changed to represent different search tasks; supported by example implementations in code.
The accompanying Rust code implements the same six-step surrogate loop across all applications, demonstrating practical reproducibility of the framework.
Authors state that pedagogical Rust code is provided showing the exact same loop running all applications; code repository accompanies the paper.
An adaptive trust radius constrains surrogate-guided steps to regions where the surrogate is reliable (trust-region control).
Methodological description of adaptive trust-radius control in the surrogate loop; used in experiments demonstrating improved reliability of steps proposed by the surrogate.
Acquisition criteria (active learning) drive which points are evaluated next; different acquisition functions implement the different search tasks (minimization, single-point saddles, double-ended searches).
Method section describing task-specific acquisition functions and their role in selecting evaluation points; implemented in the Rust code and used in experiments reported in the paper.
A unified Bayesian optimization framework—implemented as a six-step surrogate loop—handles minimization, single-point saddle searches, and double-ended saddle searches by changing only the inner optimization target and acquisition criterion.
Methodological description in the paper: presentation of a six-step surrogate loop (build/update GP → select acquisition target → inner optimization on surrogate → propose evaluation points → evaluate with true model → update surrogate) parameterized so inner objective and acquisition encode different tasks; accompanied by pedagogical Rust code implementing the same loop for all tasks.
The set of loss functions for which classical evaluation is possible includes expectation-based losses, kernel/MMD-like objectives, and other standard generative-model criteria (a broad loss-function scope).
Theoretical coverage and examples in the paper enumerating loss families (expectations, MMD, certain divergences) and showing how the classical-approximation results apply to each. The claim is supported by derivations and examples provided in the text.
A wide class of loss functions (including expectation-based losses and kernel/MMD-style objectives) and their gradients can be evaluated or efficiently approximated on a classical computer for BSBMs using recent classical-approximation results for expectation values in linear optics.
Theoretical argument in the paper leveraging recent classical-approximation results for expectation values in linear optics; covers expectation-based losses and kernel/MMD-like divergences and provides constructions/complexity statements showing efficient classical evaluation/approximation of these losses and, in many cases, their gradients. (The claim is based on proofs/derivations rather than empirical data.)
PRF design decomposes into two independent dimensions: feedback source (where feedback text comes from) and feedback model (how that feedback is used to refine the query).
Paper's conceptual framing and controlled experiments that isolate and vary these two factors independently.
The paper proposes specific operational and market recommendations: firms should invest in middleware and co-design partnerships; policymakers should fund shared QCSC infrastructure and workforce programs; researchers should prioritize interoperable middleware, scheduling models, and economic experiments on access-pricing.
Explicit recommendations section synthesizing prior architectural and economic analysis; prescriptive assertions based on conceptual arguments rather than experimental validation.
Middleware standardization and interoperable APIs reduce switching costs and foster competition; lack of standards risks vendor lock-in and higher long-run costs.
Economic and systems-design argument drawing on well-understood effects of standardization in software ecosystems; no empirical QCSC-standardization case studies provided.
QCSC reference architecture elements — e.g., QPU integration patterns, low-latency interconnects, orchestration and scheduling middleware, unified programming environments, data staging strategies — are required components to address current friction.
System decomposition and interface requirements derived from use-case analysis; proposed architecture components listed and motivated; no experimental validation.
The GNN provides greater stability (robustness over time and across conditions) than the MLP, with marked gains at low elevation angles where propagation is most variable.
Evaluation metrics in the experiments included stability/robustness over time and across elevation-angle conditions; reported performance shows larger relative gains for the GNN at low elevation angles.
A Graph Neural Network (GNN) model significantly outperforms a Multi-Layer Perceptron (MLP) baseline in beam prediction accuracy.
Supervised comparison reported in the paper between an MLP baseline and a GNN on realistic channel and beamforming data, evaluated with beam prediction accuracy metrics.
A strictly non-reciprocal interaction bias (directional/asymmetric effects between competitors) is necessary to suppress local fluctuations and produce a robust absorbing (permanent monopoly) state.
Theoretical analysis of absorbing states and stability conditions in the model, with supporting numerical simulations comparing symmetric versus non-reciprocal interaction rules (simulation counts unspecified). Results are internal to the model framework.
Early advantage in discovering resources (transient superiority) is governed by extreme-value statistics of first-passage times: rare, fast discoveries determine which population gets early footholds.
Analytic derivation applying extreme-value theory to first-passage times in the paper's stochastic, spatially-structured population model; supported by numerical simulations of stochastic realizations (simulation details unspecified). This is a theoretical/computational result (no empirical data).
Weighted-FSD provides a tunable knob to encode risk aversion/preferences by selecting quantile-weighting functions.
Theoretical correspondence between quantile weights and risk measures (SRMs) described in the paper; conceptual demonstration that different weightings produce different risk profiles.
Introducing quantile-weighted FSD (weighted-FSD) provably controls broad classes of Spectral Risk Measures (SRMs): improving weighted-FSD implies guaranteed improvements in the associated SRM.
Formal theoretical result/proof presented in the paper linking weighted quantile dominance to monotonic improvement in corresponding SRMs.
RAD operationalizes FSD by comparing the learned policy’s empirical rollout cost distribution to a reference policy’s distribution using Optimal Transport (OT) with entropic regularization and Sinkhorn iterations.
Methodological description in the paper: entropically regularized OT objective and Sinkhorn iterations used to compare empirical distributions and produce a differentiable loss.
First-Order Stochastic Dominance (FSD) constraints compare whole cost distributions and directly constrain tails, offering stronger guarantees against high-cost (unsafe) outcomes than expected-cost constraints.
Theoretical property of FSD described in the paper; formal argument that FSD constrains the full distribution (CDF) rather than only its mean.
Policy recommendations include subsidizing complementary investments (data governance, training) rather than technology-only incentives; encouraging standards and interoperability; and funding evaluation studies to measure distributional effects and long-run productivity impacts.
Authors' policy section proposing these interventions based on case findings and broader policy implications.
The authors propose a conceptual optimisation framework emphasizing three pillars: digital integration (tech stack & data), collaboration (processes & governance), and continuous improvement (metrics, feedback loops).
Paper presents a conceptual framework derived from cross-case findings; theoretical/conceptual contribution rather than empirical estimation.
Explanations must be tailored to stakeholders (clinicians, regulators, customers) and integrated into decision processes to be useful (human-centered design principle).
Thematic coding of design and HCI literature within the review; draws on empirical studies and design guidance recommending stakeholder-specific explanation formats and integration into decision workflows.
The forecasting model was deployed with a human-in-the-loop mechanism that triggers on critical forecast deviations.
Pilot description in the paper documenting integration of H-in-the-loop rules for critical deviations during pilot deployment (single-case deployment evidence).