Evidence (6574 claims)
Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 761 | 200 | 101 | 904 | 2020 |
| Governance & Regulation | 829 | 400 | 191 | 122 | 1566 |
| Organizational Efficiency | 784 | 193 | 125 | 84 | 1197 |
| Technology Adoption Rate | 637 | 236 | 124 | 97 | 1103 |
| Research Productivity | 431 | 131 | 58 | 340 | 972 |
| Output Quality | 481 | 183 | 59 | 47 | 770 |
| Decision Quality | 332 | 177 | 82 | 49 | 647 |
| Firm Productivity | 439 | 57 | 88 | 20 | 610 |
| AI Safety & Ethics | 218 | 279 | 66 | 33 | 602 |
| Market Structure | 181 | 170 | 123 | 24 | 503 |
| Task Allocation | 214 | 64 | 72 | 33 | 388 |
| Skill Acquisition | 174 | 62 | 62 | 17 | 315 |
| Innovation Output | 204 | 27 | 45 | 18 | 295 |
| Employment Level | 105 | 54 | 108 | 13 | 282 |
| Fiscal & Macroeconomic | 132 | 69 | 43 | 26 | 277 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 154 | 48 | 26 | 3 | 231 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 123 | 50 | 6 | 223 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 71 | 92 | 10 | 2 | 175 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 58 | 56 | 26 | 13 | 156 |
| Training Effectiveness | 96 | 21 | 14 | 19 | 152 |
| Wages & Compensation | 77 | 37 | 25 | 6 | 145 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 81 | 21 | 1 | 115 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 47 | 6 | 1 | 59 |
| Social Protection | 28 | 16 | 8 | 2 | 54 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
To enable scalable evaluation, we use a rubric-based visual language model (VLM) judge and validate its reliability through human annotation.
Method and validation claim in abstract stating use of rubric-based VLM and validation against human annotations.
The final stage uses design-specific rubrics to assess functionality, manufacturability, and assemblability, moving beyond shape matching toward practical design quality.
Paper's description of the benchmark's evaluation rubric and intended assessment criteria (abstract).
MUSE pairs practical design instances with structured Design Specifications and evaluates generated models through a three-stage protocol: code check, geometric check, and design-intent alignment.
Methodological description in abstract indicating dataset pairing and three-stage evaluation protocol.
We introduce MUSE, a Text-to-CAD benchmark focused on complex, editable boundary representation (B-Rep) assemblies.
Paper contribution / dataset creation described in abstract; supported by project website and accompanying dataset/code.
By Round 3, equity-aware LLM refinement reduces energy costs by 3.2%.
Empirical results reported in abstract: energy cost reduction of 3.2% after three rounds of LLM-mediated reward refinement (15 experimental runs).
By Round 3, equity-aware LLM refinement improves satisfaction for Elderly Females (+567%).
Empirical results reported in abstract following three rounds of LLM-based reward refinement; improvement magnitude given as +567%. 15 experimental runs.
By Round 3, equity-aware LLM refinement improves satisfaction for Health Sensitive (+53.8%).
Empirical results reported in abstract following three rounds of LLM-based reward refinement; improvement magnitude given as +53.8%. 15 experimental runs.
By Round 3, equity-aware LLM refinement improves satisfaction for Mid-aged Females (+28.2%).
Empirical results reported in abstract following three rounds of LLM-based reward refinement; improvement magnitude given as +28.2%. 15 experimental runs.
By Round 3, equity-aware LLM refinement improves satisfaction for Young Males (+17.6%).
Empirical results reported in abstract following three rounds of LLM-based reward refinement; improvement magnitude given as +17.6%. 15 experimental runs.
We introduce the Comfort Equity Index (CEI) as a novel feedback signal.
Paper contribution / methodological description introducing CEI (no quantitative validation details reported in abstract).
The paper provides a conceptual foundation for designing AI systems that model expert sensing over time, positioning cognition as an infrastructural, operational, and professional domain in persistent human-AI systems.
Stated contribution of the paper (conceptual/theoretical contribution rather than empirical evidence).
The Cognitive Operations Research and Training Framework (CORTF) is introduced to support research, education, and workforce development.
Conceptual framework proposed in the paper (no empirical implementation or evaluation presented).
The Cognitive Operations Manager is proposed as a prototype AI-native professional role for coordinating tacit signal modelling, semantic modelling, AI system calibration, expert validation, and ethical governance.
Proposal of a new professional role in the paper (conceptual/visionary; no pilot study, job analysis, or workforce data reported).
Long-term Cognitive Operations are defined as the practices required to maintain and govern such systems, including memory curation, semantic organisation, tacit signal modelling, reasoning calibration, and cognitive governance.
Conceptual taxonomy/definition introduced in the paper (theoretical framing; no empirical validation).
Tacit Signal Infrastructure is introduced as a layer for capturing, structuring, modelling, interpreting, and validating expert tacit signals over time.
Conceptual design/proposal presented in the paper (architectural description; no empirical implementation or evaluation reported).
Next-generation AI systems should move beyond explicit knowledge processing toward the longitudinal modelling of expert tacit sensing.
Normative proposal / recommendation made in the paper as part of a vision; supported by conceptual rationale rather than empirical data.
High-level expertise also depends on tacit sensing: perceiving weak signals, recognising emerging tensions, detecting coherence degradation, and anticipating instability before formal indicators appear.
Conceptual claim grounded in cognitive-science-informed argumentation presented in the paper (no empirical study or sample size reported).
Current generative AI systems are increasingly effective at processing explicit knowledge, including retrieving information, summarising documents, generating explanations, and supporting codified workflows.
Asserted in the paper as a descriptive trend; based on literature synthesis and observations of current generative AI capabilities (no empirical sample or experiment reported in the paper).
To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.
Authors' recommendations based on observed shortcomings in human–AI collaboration in the study (no direct experimental test of these interventions reported in the abstract).
Human–AI collaboration performs better than either AI or humans alone.
Comparison of collaborative team performance versus AI-alone and human-alone performance reported from the experiment.
Two non-negotiable design requirements guide the architecture: cognitive-load redistribution (DR1) and bounded autonomy with alignment (DR2).
Design requirements explicitly stated in the paper guiding the HARMONY architecture.
The model introduces 'Orchestration Leverage' as a candidate productivity metric suited to human–agent hybrid systems.
Conceptual proposal within the paper (new metric introduced as part of HARMONY).
We propose HARMONY (Hybrid Agentic Research Model for Organisational New Yield), a four-pillar socio-technical architecture comprising ResOps (Industrialized Execution), the Control Tower (Strategic Visibility and Drift Detection), the Ethics Fabric (Bounded Autonomy by Design), and the Talent Studio (Sciencepreneur Capability).
Design Science Research artifact (proposed operating model described in the paper).
The framework establishes a principled vocabulary for designing enterprise service platforms that manage human and artificial intelligence labor responsibly, transparently, and at scale.
Paper presents the combined constructs (Workforce Unit Abstraction, Hybrid Capacity Model, Governance-bound Autonomy) as a coherent reference model and vocabulary; described as conceptual contribution arising from the design-science approach.
Governance-bound autonomy constrains AI Workforce Unit actions within a five-level, policy-enforced autonomy ladder supported by six mandatory governance controls.
Conceptual governance artifact described in the paper (five-level autonomy ladder + six governance controls); presented as the proposed governance design, not as an empirically tested intervention in the abstract.
The Hybrid Capacity Model extends demand-to-supply planning across heterogeneous workforce pools, resolving a multi-objective allocation problem that simultaneously optimizes cost, quality, and risk constraints.
Described model/algorithmic artifact in the paper (Hybrid Capacity Model) claiming multi-objective optimization; no empirical benchmark or sample size reported in the provided text.
The Workforce Unit Abstraction defines a unified seven-attribute operational schema applicable to both human workers and AI agents, enabling consistent representation across planning, scheduling, and governance systems.
Artifact description from the paper (Workforce Unit Abstraction with seven attributes); presented as a designed schema rather than an empirically validated result in the abstract.
This article introduces three constructs as reusable primitives for hybrid workforce platform design.
Design science research methodology producing an artifact (three constructs); described as the paper's contribution. No empirical evaluation or sample size reported in the abstract.
Augment Engineering completes a three-discipline progression: Prompt Engineering (one tool), Context Engineering (reproducible pipelines), Augment Engineering (a portfolio of tools across domains).
Conceptual framing presented in the paper describing a proposed progression of disciplines.
A Wright's Law fit (n = 82 artifacts, p < 0.01) shows production acceleration across the artifact portfolio.
Quantitative model reported in the paper: Wright's Law fit on 82 artifacts with reported p-value < 0.01.
A Cochran-Armitage trend test (n = 200 interactions across two chat LLMs, p < 0.01) shows first-pass acceptance rising with prompt-sophistication level.
Quantitative test reported in the paper: Cochran-Armitage trend test on 200 interactions across two chat LLMs, reported p-value < 0.01.
A 5-month formative case study (Nov 2025 to Mar 2026) documents a single practitioner applying Augment Engineering skills across a ten-component orchestration stack spanning seven professional domains, producing work products that would traditionally involve separate domain specialists.
Case study reported in the paper describing one practitioner's activities over five months across a 10-component stack in seven domains; sample size = 1 practitioner.
The paper presents a six-phase orchestration methodology and four portability metrics for Augment Engineering.
Stated methodological contribution within the paper (description of methodology and metrics).
Augment Engineering is a discipline of orchestrating multiple purpose-built AI tools across distinct professional domains, applying prompt and context engineering as portable competencies that transfer across tool boundaries.
Definition and conceptual development presented in the paper (methodological contribution).
Prompt engineering (interaction-level optimization) and context engineering (structured input pipeline design) are domain-portable meta-skills: a practitioner who masters them can apply them to any purpose-built AI tool in any domain.
Conceptual claim supported by the paper's argumentation and exemplified by a single-practitioner case study.
The framework has implications for digital health, education, AI personalisation, and personal agency.
Authors' discussion in paper of potential implications across these application domains; presented qualitatively.
The authors list six operational requirements for state-aware systems.
Explicit statement in paper that six operational requirements are listed; descriptive rather than empirically tested in abstract.
The authors derive seven testable predictions from the state-aware framework.
Explicit statement in paper that seven testable predictions are derived from the framework; no individual prediction effects quantified in abstract.
The paper is supported by a 24-month observational base from a deployed behavioural platform spanning more than 200,000 consented users across four occupational personas (research period 2023 to 2026).
Empirical dataset described in the paper: observational deployment over 24 months, >200,000 consented users, four occupational personas, timeframe given (2023–2026).
The framework is motivated by six strands of established evidence: causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, and computational psychiatry.
Explicit statement in paper describing the literature strands used to motivate the framework.
Taken together, these claims imply that the outcome of a given event is controllable, conditionally, on the state-trajectory at the time of intervention.
Synthesis/implication drawn by authors from the conceptual framework and the six literature strands; argued but not quantified in abstract.
The conscious channel through which outcomes are reportable is a narrow attentional bottleneck whose contents are themselves state-dependent.
Theoretical claim supported by attentional bottleneck literature cited in the paper; presented as part of the conceptual framework.
The weighting vector (state) is dynamic at sub-daily timescales.
Claim motivated by chronobiology and related literature cited in the paper; authors state the sub-daily dynamism as part of their framework.
The relationship between state, decision, and outcome is causal rather than correlational.
Argument grounded in causal inference literature cited by the authors; presented as a core theoretical claim in the paper rather than demonstrated by a specific randomized experiment in the abstract.
A state can be defined as the time-indexed weighting vector over the dimensions that govern how an individual's biology, physiology, and neuropsychology process the next event into a decision and an outcome.
Explicit definitional claim / framework component introduced by the authors; justified conceptually via multidisciplinary literature cited in the paper.
Human outcomes are controllable in a precise and operational sense through interventions that target the state and its weighting at the moment a decision is being formed.
Theoretical argument in the paper, motivated by the six literature strands; supported in part by the authors' deployed behavioural platform (see separate claim about dataset) but no randomized effect sizes reported in abstract.
This persistent variability belongs in a dynamic latent state of the person (i.e., is best modelled as a time-varying latent state).
Conceptual claim supported by integration of six strands of established evidence (causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, computational psychiatry) cited in the paper.
Within-person variability persists: the same individual, presented with the same observable input, produces different outcomes on different occasions, and different individuals produce divergent outcomes that no observable covariate fully predicts.
Statement motivated by literature review across behavioural sciences; argued in paper as empirical puzzle rather than proven with new statistics in this manuscript.
Agents share successes and failures to reduce redundant exploration during long-running experiments.
Design of AutoScientists includes mechanisms for recording and sharing experimental outcomes; asserted benefit in paper that this reduces redundant exploration (qualitative and supported by experimental comparisons).
Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).
Empirical evaluation across all 217 assays in the ProteinGym benchmark; reported aggregate improvement in Spearman correlation versus prior state-of-the-art.