Evidence (13827 claims)
Adoption
8454 claims
Productivity
7544 claims
Governance
6789 claims
Human-AI Collaboration
6327 claims
Org Design
4126 claims
Innovation
4058 claims
Labor Markets
3520 claims
Skills & Training
2924 claims
Inequality
2057 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 749 | 195 | 97 | 889 | 1979 |
| Governance & Regulation | 815 | 391 | 188 | 121 | 1539 |
| Organizational Efficiency | 771 | 189 | 124 | 83 | 1177 |
| Technology Adoption Rate | 624 | 233 | 123 | 96 | 1084 |
| Research Productivity | 410 | 121 | 56 | 331 | 929 |
| Output Quality | 466 | 177 | 59 | 47 | 749 |
| Decision Quality | 320 | 174 | 75 | 42 | 618 |
| Firm Productivity | 435 | 55 | 88 | 20 | 604 |
| AI Safety & Ethics | 214 | 276 | 65 | 33 | 593 |
| Market Structure | 178 | 166 | 122 | 24 | 495 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 165 | 57 | 60 | 17 | 299 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Employment Level | 105 | 51 | 107 | 13 | 278 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 116 | 63 | 42 | 11 | 232 |
| Firm Revenue | 149 | 46 | 26 | 3 | 224 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Task Completion Time | 169 | 29 | 8 | 12 | 219 |
| Worker Satisfaction | 89 | 61 | 20 | 12 | 182 |
| Error Rate | 69 | 91 | 10 | 2 | 172 |
| Regulatory Compliance | 76 | 68 | 14 | 5 | 163 |
| Training Effectiveness | 92 | 19 | 13 | 19 | 145 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Automation Exposure | 51 | 54 | 22 | 12 | 142 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Developer Productivity | 94 | 17 | 14 | 6 | 132 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Skill Obsolescence | 5 | 45 | 6 | 1 | 57 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
A Wright's Law fit (n = 82 artifacts, p < 0.01) shows production acceleration across the artifact portfolio.
Quantitative model reported in the paper: Wright's Law fit on 82 artifacts with reported p-value < 0.01.
A Cochran-Armitage trend test (n = 200 interactions across two chat LLMs, p < 0.01) shows first-pass acceptance rising with prompt-sophistication level.
Quantitative test reported in the paper: Cochran-Armitage trend test on 200 interactions across two chat LLMs, reported p-value < 0.01.
A 5-month formative case study (Nov 2025 to Mar 2026) documents a single practitioner applying Augment Engineering skills across a ten-component orchestration stack spanning seven professional domains, producing work products that would traditionally involve separate domain specialists.
Case study reported in the paper describing one practitioner's activities over five months across a 10-component stack in seven domains; sample size = 1 practitioner.
The paper presents a six-phase orchestration methodology and four portability metrics for Augment Engineering.
Stated methodological contribution within the paper (description of methodology and metrics).
Augment Engineering is a discipline of orchestrating multiple purpose-built AI tools across distinct professional domains, applying prompt and context engineering as portable competencies that transfer across tool boundaries.
Definition and conceptual development presented in the paper (methodological contribution).
Prompt engineering (interaction-level optimization) and context engineering (structured input pipeline design) are domain-portable meta-skills: a practitioner who masters them can apply them to any purpose-built AI tool in any domain.
Conceptual claim supported by the paper's argumentation and exemplified by a single-practitioner case study.
The framework has implications for digital health, education, AI personalisation, and personal agency.
Authors' discussion in paper of potential implications across these application domains; presented qualitatively.
The authors list six operational requirements for state-aware systems.
Explicit statement in paper that six operational requirements are listed; descriptive rather than empirically tested in abstract.
The authors derive seven testable predictions from the state-aware framework.
Explicit statement in paper that seven testable predictions are derived from the framework; no individual prediction effects quantified in abstract.
The paper is supported by a 24-month observational base from a deployed behavioural platform spanning more than 200,000 consented users across four occupational personas (research period 2023 to 2026).
Empirical dataset described in the paper: observational deployment over 24 months, >200,000 consented users, four occupational personas, timeframe given (2023–2026).
The framework is motivated by six strands of established evidence: causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, and computational psychiatry.
Explicit statement in paper describing the literature strands used to motivate the framework.
Taken together, these claims imply that the outcome of a given event is controllable, conditionally, on the state-trajectory at the time of intervention.
Synthesis/implication drawn by authors from the conceptual framework and the six literature strands; argued but not quantified in abstract.
The conscious channel through which outcomes are reportable is a narrow attentional bottleneck whose contents are themselves state-dependent.
Theoretical claim supported by attentional bottleneck literature cited in the paper; presented as part of the conceptual framework.
The weighting vector (state) is dynamic at sub-daily timescales.
Claim motivated by chronobiology and related literature cited in the paper; authors state the sub-daily dynamism as part of their framework.
The relationship between state, decision, and outcome is causal rather than correlational.
Argument grounded in causal inference literature cited by the authors; presented as a core theoretical claim in the paper rather than demonstrated by a specific randomized experiment in the abstract.
A state can be defined as the time-indexed weighting vector over the dimensions that govern how an individual's biology, physiology, and neuropsychology process the next event into a decision and an outcome.
Explicit definitional claim / framework component introduced by the authors; justified conceptually via multidisciplinary literature cited in the paper.
Human outcomes are controllable in a precise and operational sense through interventions that target the state and its weighting at the moment a decision is being formed.
Theoretical argument in the paper, motivated by the six literature strands; supported in part by the authors' deployed behavioural platform (see separate claim about dataset) but no randomized effect sizes reported in abstract.
This persistent variability belongs in a dynamic latent state of the person (i.e., is best modelled as a time-varying latent state).
Conceptual claim supported by integration of six strands of established evidence (causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, computational psychiatry) cited in the paper.
Within-person variability persists: the same individual, presented with the same observable input, produces different outcomes on different occasions, and different individuals produce divergent outcomes that no observable covariate fully predicts.
Statement motivated by literature review across behavioural sciences; argued in paper as empirical puzzle rather than proven with new statistics in this manuscript.
Agents share successes and failures to reduce redundant exploration during long-running experiments.
Design of AutoScientists includes mechanisms for recording and sharing experimental outcomes; asserted benefit in paper that this reduces redundant exploration (qualitative and supported by experimental comparisons).
Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).
Empirical evaluation across all 217 assays in the ProteinGym benchmark; reported aggregate improvement in Spearman correlation versus prior state-of-the-art.
On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation.
Empirical evaluation on the ACE2-Spike assay within the ProteinGym benchmark; reported relative improvement in Spearman correlation versus prior state-of-the-art.
On GPT training optimization, AutoScientists continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements).
Empirical comparison of discovered/accepted improvements during GPT training optimization; counts of accepted improvements for AutoScientists (7) versus single-agent approach (0).
On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch.
Empirical training-time comparison between AutoScientists and Autoresearch on GPT training optimization tasks; reported speedup multiplier to reach a validation bits-per-byte target.
On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%.
Empirical evaluation on the BioML-Bench benchmark (24 tasks); reported mean leaderboard percentile and comparative improvement versus the strongest baseline agent.
Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction.
Empirical comparisons reported in paper across multiple benchmark suites and tasks (BioML-Bench, GPT training optimization experiments, ProteinGym).
AutoScientists is a decentralized team of AI agents that interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration.
System design and implementation described in the paper (architecture and agent protocols); qualitative description of agent behaviors and coordination mechanisms; demonstrated in experiments.
We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.
Statement of the paper's contributions and contents (methodological description of what the paper includes).
By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation.
Stated objective/claim in the paper about the benchmark's purpose and what it measures (conceptual/goal-oriented statement).
OR-Space defines an Explain task mode, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts.
Definition of the Explain task mode provided in the paper (design/specification).
OR-Space defines a Revise task mode, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic.
Definition of the Revise task mode in the benchmark design (descriptive claim in the paper).
OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts.
Definition of one of the benchmark's task modes as described in the paper (method/design description).
Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files.
Design specification of OR-Space provided in the paper (descriptive claim about benchmark instance structure).
We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation.
Paper presents and names a new benchmark (methodological contribution described directly in the text).
Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling.
Statement in the paper asserting an observed trend; likely based on literature/context motivating the work (no empirical sample or quantitative citation provided in the excerpt).
A recommended organizational design for the AI era is the 'resonance protocol enterprise' in which structures are temporary crystallizations, AI governance protects adaptive openness, and legitimacy derives from sustaining recursive renewal.
Normative/proposal in the paper outlining a new organizational design paradigm; presented as conceptual design without empirical pilot or evaluation.
Digital transformation initially enhanced adaptability by fluidifying information flows and expanding relational connectivity, thereby improving some organizations' adaptability.
Theoretical claim supported by qualitative interpretation of digital transformation phenomena; no systematic measurement or reported sample.
Organizations capable of rapid relational reconfiguration, customer reconnection, and generative experimentation often proved more resilient during the pandemic.
Illustrative/theoretical interpretation of pandemic cases offered in the paper; no quantified sample or formal empirical evidence reported.
Although AI creates obstacles, it also has the potential to be an important tool for creating innovative opportunities and continued growth if managed with sound practices.
Concluding statement in the paper's abstract presenting a normative/conditional conclusion based on the paper's evaluation and synthesis of evidence (no primary quantified results provided in the supplied text).
AI leads to the creation of new jobs.
The paper explicitly states it examines the creation of new jobs as a ramification of AI (abstract); claim presented qualitatively without reported sample sizes or quantified effect in the provided text.
GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.
Architectural description in the paper; claim about knowledge base acting as ground truth and enabling capability compounding (design-level claim). No quantitative evaluation given in the abstract.
GENESIS is an agentic AI framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base.
System design / implementation claim presented in the paper (description of proposed framework). The abstract does not report empirical evaluation metrics or sample size.
Large Language Models (LLMs) have compressed comparable R&D work in general software engineering from days to minutes.
Paper's stated comparison/claim (likely based on prior reports or authors' experience); no experimental details or sample size provided in the abstract.
Operational reasoning paradigms such as ReasonOps may become foundational infrastructure for next-generation trustworthy AI ecosystems.
Author's forward-looking argument / conjecture about the potential future impact and adoption of operational reasoning paradigms; presented as an argument rather than demonstrated empirically in the excerpt.
The paper presents the ReasonOps architecture, demonstrates its workflow using an autonomous braking system analysis example, and discusses its potential role in future safety-critical autonomous AI systems.
Author statement about the paper's content and demonstration (explicitly claims an architecture and an example walkthrough); evidence is the paper's own descriptive content.
The proposed paradigm integrates semantic interpretation, autoformalization, symbolic reasoning, theorem proving, runtime assurance, probabilistic reliability estimation, and adaptive correction into a unified reasoning lifecycle.
Author claim about the architecture and components of ReasonOps; presented as a proposed integrated lifecycle in the paper (no empirical evaluation reported in excerpt).
ReasonOps treats reasoning as a continuously monitored, verifiable, reliability-aware operational process rather than an isolated inference task.
Author description of the ReasonOps paradigm and its operational stance (conceptual framework described in paper).
This paper introduces ReasonOps, a unified operational paradigm for trustworthy verified reasoning systems.
Declarative claim about the paper's contribution (introduction of a named paradigm); supported by the paper itself (architectural description and example claimed).
Recent advances in theorem proving, autoformalization, symbolic reasoning, and tool-augmented language models demonstrate substantial progress toward machine-assisted formal reasoning.
Author statement citing multiple research directions (theorem proving, autoformalization, symbolic reasoning, tool-augmented LMs); no specific empirical results or quantitative studies provided in excerpt.
Large Language Models (LLMs) have transformed artificial intelligence from primarily generative systems into increasingly capable reasoning agents.
Author assertion in paper's introduction; conceptual argument referencing recent developments in LLMs (no empirical study or sample size reported in text excerpt).