Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Tri-jurisdictional firms had larger workforces (5,380 ± 1,245).
Reported descriptive statistics in the paper: 'Tri-jurisdictional firms had larger workforces (5,380 ± 1,245)'.
Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.
Statement in abstract and provided URL pointing to project artifacts.
Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design.
Paper's stated contribution and intended purpose (abstract) and provision of dataset/benchmark artifacts via project website.
To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation.
Method and validation claim in abstract stating use of rubric-based VLM and validation against human annotations.
The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality.
Paper's description of the benchmark's evaluation rubric and intended assessment criteria (abstract).
MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment.
Methodological description in abstract indicating dataset pairing and three-stage evaluation protocol.
We introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies.
Paper contribution / dataset creation described in abstract; supported by project website and accompanying dataset/code.
By Round 3, equity-aware LLM refinement reduces energy costs by 3.2%.
Empirical results reported in abstract: energy cost reduction of 3.2% after three rounds of LLM-mediated reward refinement (15 experimental runs).
By Round 3, equity-aware LLM refinement improves satisfaction for Elderly Females (+567%).
Empirical results reported in abstract following three rounds of LLM-based reward refinement; improvement magnitude given as +567%. 15 experimental runs.
By Round 3, equity-aware LLM refinement improves satisfaction for Health Sensitive (+53.8%).
Empirical results reported in abstract following three rounds of LLM-based reward refinement; improvement magnitude given as +53.8%. 15 experimental runs.
By Round 3, equity-aware LLM refinement improves satisfaction for Mid-aged Females (+28.2%).
Empirical results reported in abstract following three rounds of LLM-based reward refinement; improvement magnitude given as +28.2%. 15 experimental runs.
By Round 3, equity-aware LLM refinement improves satisfaction for Young Males (+17.6%).
Empirical results reported in abstract following three rounds of LLM-based reward refinement; improvement magnitude given as +17.6%. 15 experimental runs.
We introduce the Comfort Equity Index (CEI) as a novel feedback signal.
Paper contribution / methodological description introducing CEI (no quantitative validation details reported in abstract).
Multimodal contrastive learning enables generative AI to output images that closely align with text prompts.
Stated as background/technical premise in the paper (based on prior work on multimodal contrastive learning; no experiment details provided in the abstract).
Human-subject experiments further validate the commercial effectiveness of the utility-aware method.
Reported human-subject experiments in the paper that are said to validate commercial effectiveness (details such as sample size, design, and metrics are not provided in the abstract).
In downstream applications on Amazon and Airbnb, product images generated and edited by our method outperform state-of-the-art models in increasing demand and preserving fidelity, while maintaining text-image consistency.
Empirical evaluation on downstream applications using Amazon and Airbnb datasets / deployments reported in the paper (experiments comparing their method to state-of-the-art models; exact sample sizes and metrics not provided in the abstract).
The effect arises from a shift in the learned image-text representation space toward demand-driven visual cues, which we validate through a theoretical bound on the proposed objective.
Theoretical analysis presented in the paper claiming a bound that links the utility-aware objective to representation shifts toward demand-relevant features.
Optimizing this utility-aware objective guides generation toward images that are both semantically coherent and demand-enhancing.
Claim supported in the paper by a theoretical bound and by downstream empirical evaluation (described in the abstract; experiments on marketplace data referenced).
We propose a utility-aware multimodal contrastive learning framework that incorporates consumer demand into a novel Utility-Aware InfoNCE loss.
Methodological contribution described in the paper (proposal of a new loss function and framework; supported by method description and theoretical development).
Product images strongly influence consumer decision-making in online marketplaces.
Stated as background motivation in the paper (cites prior literature / widely accepted premise; no specific sample or experiment reported in the excerpt).
The paper provides a conceptual foundation for designing AI systems that model expert sensing over time, positioning cognition as an infrastructural, operational, and professional domain in persistent human-AI systems.
Stated contribution of the paper (conceptual/theoretical contribution rather than empirical evidence).
The Cognitive Operations Research and Training Framework (CORTF) is introduced to support research, education, and workforce development.
Conceptual framework proposed in the paper (no empirical implementation or evaluation presented).
The Cognitive Operations Manager is proposed as a prototype AI-native professional role for coordinating tacit signal modelling, semantic modelling, AI system calibration, expert validation, and ethical governance.
Proposal of a new professional role in the paper (conceptual/visionary; no pilot study, job analysis, or workforce data reported).
Long-term Cognitive Operations are defined as the practices required to maintain and govern such systems, including memory curation, semantic organisation, tacit signal modelling, reasoning calibration, and cognitive governance.
Conceptual taxonomy/definition introduced in the paper (theoretical framing; no empirical validation).
Tacit Signal Infrastructure is introduced as a layer for capturing, structuring, modelling, interpreting, and validating expert tacit signals over time.
Conceptual design/proposal presented in the paper (architectural description; no empirical implementation or evaluation reported).
Next-generation AI systems should move beyond explicit knowledge processing toward the longitudinal modelling of expert tacit sensing.
Normative proposal / recommendation made in the paper as part of a vision; supported by conceptual rationale rather than empirical data.
High-level expertise also depends on tacit sensing: perceiving weak signals, recognising emerging tensions, detecting coherence degradation, and anticipating instability before formal indicators appear.
Conceptual claim grounded in cognitive-science-informed argumentation presented in the paper (no empirical study or sample size reported).
Current generative AI systems are increasingly effective at processing explicit knowledge, including retrieving information, summarising documents, generating explanations, and supporting codified workflows.
Asserted in the paper as a descriptive trend; based on literature synthesis and observations of current generative AI capabilities (no empirical sample or experiment reported in the paper).
Beyond compute sharing, SwarmHarness is a foundational primitive for autonomous distributed AI agent networks in which agents hire compute, route subtasks, and settle credits without human intermediation.
Forward-looking claim about potential applications of the proposed protocol; described conceptually with no experimental validation or deployment case studies.
As nodes specialise toward high-reward skills and routing signals act as digital pheromones, the network exhibits emergent collective intelligence analogous to biological swarms.
Theoretical/analogical claim based on designed routing signals and incentives; presented as expected emergent behaviour without empirical demonstration.
Nodes earn credits by serving tasks and spend credits to submit them; idle nodes that never contribute drain credits and lose routing priority, creating a self-regulating participation economy.
Mechanism and expected economic dynamics described as part of the protocol design; no experimental or deployment evidence provided to demonstrate the claimed emergent behaviour.
SwarmHarness has three interlocking components: a SwarmRegistry built on a Distributed Hash Table (DHT) for peer discovery and capability advertisement; a SwarmRouter that dispatches tasks to nodes using a utility function over capability, load, latency, and trust; and SwarmCredit, an incentive mechanism that attributes compute-credit rewards to contributing nodes via a Shapley-value approximation.
Architectural description in the paper; specified components and mechanisms described as part of the proposed system design, without empirical validation.
We propose SwarmHarness, a decentralised protocol in which HarnessAPI skill nodes self-organise into a compute swarm without any central authority.
Design/proposal presented in the paper; no implementation results or deployment metrics provided.
To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.
Authors' recommendations based on observed shortcomings in human–AI collaboration in the study (no direct experimental test of these interventions reported in the abstract).
Human–AI collaboration performs better than either AI or humans alone.
Comparison of collaborative team performance versus AI-alone and human-alone performance reported from the experiment.
Two non-negotiable design requirements guide the architecture: cognitive-load redistribution (DR1) and bounded autonomy with alignment (DR2).
Design requirements explicitly stated in the paper guiding the HARMONY architecture.
The model introduces 'Orchestration Leverage' as a candidate productivity metric suited to human–agent hybrid systems.
Conceptual proposal within the paper (new metric introduced as part of HARMONY).
We propose HARMONY (Hybrid Agentic Research Model for Organisational New Yield), a four-pillar socio-technical architecture comprising ResOps (Industrialized Execution), the Control Tower (Strategic Visibility and Drift Detection), the Ethics Fabric (Bounded Autonomy by Design), and the Talent Studio (Sciencepreneur Capability).
Design Science Research artifact (proposed operating model described in the paper).
The framework establishes a principled vocabulary for designing enterprise service platforms that manage human and artificial intelligence labor responsibly, transparently, and at scale.
Paper presents the combined constructs (Workforce Unit Abstraction, Hybrid Capacity Model, Governance-bound Autonomy) as a coherent reference model and vocabulary; described as conceptual contribution arising from the design-science approach.
Governance-bound autonomy constrains AI Workforce Unit actions within a five-level, policy-enforced autonomy ladder supported by six mandatory governance controls.
Conceptual governance artifact described in the paper (five-level autonomy ladder + six governance controls); presented as the proposed governance design, not as an empirically tested intervention in the abstract.
The Hybrid Capacity Model extends demand-to-supply planning across heterogeneous workforce pools, resolving a multi-objective allocation problem that simultaneously optimizes cost, quality, and risk constraints.
Described model/algorithmic artifact in the paper (Hybrid Capacity Model) claiming multi-objective optimization; no empirical benchmark or sample size reported in the provided text.
The Workforce Unit Abstraction defines a unified seven-attribute operational schema applicable to both human workers and AI agents, enabling consistent representation across planning, scheduling, and governance systems.
Artifact description from the paper (Workforce Unit Abstraction with seven attributes); presented as a designed schema rather than an empirically validated result in the abstract.
This article introduces three constructs as reusable primitives for hybrid workforce platform design.
Design science research methodology producing an artifact (three constructs); described as the paper's contribution. No empirical evaluation or sample size reported in the abstract.
Compounded through 500 turns of reciprocation, these differentials accumulated into in-group trust biases of +0.014 to +0.100 (d = 0.84-4.52), illustrating how modest per-interaction targeting propagates into structural inequality in persistent networks.
Aggregate/longitudinal result from the simulation after 500 turns: reported cumulative change in in-group trust bias (absolute change +0.014 to +0.100) and reported effect sizes in Cohen's d (0.84–4.52); based on the same experimental setup (6 model families, 20 seeds each).
Per-turn in-group versus out-group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed-rank, all Benjamini-Hochberg-corrected p < 0.001), establishing group-contingent targeting as a robust property of instruction-tuned language models across architectures and training regimes.
Statistical analysis reported in the paper: per-turn differential between in-group and out-group targeting measured as percentages (5–16 percentage points); significance assessed with Wilcoxon signed-rank tests and Benjamini-Hochberg correction; applied across six model families each with 20 seeds.
When group labels were visible, we observed network assortativity (all absent when labels were hidden).
Reported network-level outcomes from the simulation comparing visible vs hidden label conditions across the experimental runs (6 model families, 20 seeds each, 500 turns).
When group labels were visible, we observed action homophily.
Result reported from the simulation comparing visible versus hidden group label conditions across the described experimental runs (6 model families, 20 seeds each, 500 turns).
When group labels were visible, we observed in-group trust bias.
Result reported from the simulation comparing conditions with visible versus hidden group labels; based on interactions of instruction-tuned LLM agents across the reported experimental runs (6 model families, 20 seeds each, 500 turns).
We ran a controlled multi-agent simulation in which instruction-tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each.
Descriptive methods statement from the paper: controlled multi-agent simulation; instruction-tuned LLM agents; 3 experimental conditions (manipulating group label salience and resource scarcity); 6 model families; 20 random seeds per model; 500 turns per simulation run.
Augment Engineering completes a three-discipline progression: Prompt Engineering (one tool), Context Engineering (reproducible pipelines), Augment Engineering (a portfolio of tools across domains).
Conceptual framing presented in the paper describing a proposed progression of disciplines.