Evidence (6574 claims)
Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 761 | 200 | 101 | 904 | 2020 |
| Governance & Regulation | 829 | 400 | 191 | 122 | 1566 |
| Organizational Efficiency | 784 | 193 | 125 | 84 | 1197 |
| Technology Adoption Rate | 637 | 236 | 124 | 97 | 1103 |
| Research Productivity | 431 | 131 | 58 | 340 | 972 |
| Output Quality | 481 | 183 | 59 | 47 | 770 |
| Decision Quality | 332 | 177 | 82 | 49 | 647 |
| Firm Productivity | 439 | 57 | 88 | 20 | 610 |
| AI Safety & Ethics | 218 | 279 | 66 | 33 | 602 |
| Market Structure | 181 | 170 | 123 | 24 | 503 |
| Task Allocation | 214 | 64 | 72 | 33 | 388 |
| Skill Acquisition | 174 | 62 | 62 | 17 | 315 |
| Innovation Output | 204 | 27 | 45 | 18 | 295 |
| Employment Level | 105 | 54 | 108 | 13 | 282 |
| Fiscal & Macroeconomic | 132 | 69 | 43 | 26 | 277 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 154 | 48 | 26 | 3 | 231 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 123 | 50 | 6 | 223 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 71 | 92 | 10 | 2 | 175 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 58 | 56 | 26 | 13 | 156 |
| Training Effectiveness | 96 | 21 | 14 | 19 | 152 |
| Wages & Compensation | 77 | 37 | 25 | 6 | 145 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 81 | 21 | 1 | 115 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 47 | 6 | 1 | 59 |
| Social Protection | 28 | 16 | 8 | 2 | 54 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation.
Empirical evaluation on the ACE2-Spike assay within the ProteinGym benchmark; reported relative improvement in Spearman correlation versus prior state-of-the-art.
On GPT training optimization, AutoScientists continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements).
Empirical comparison of discovered/accepted improvements during GPT training optimization; counts of accepted improvements for AutoScientists (7) versus single-agent approach (0).
On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch.
Empirical training-time comparison between AutoScientists and Autoresearch on GPT training optimization tasks; reported speedup multiplier to reach a validation bits-per-byte target.
On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%.
Empirical evaluation on the BioML-Bench benchmark (24 tasks); reported mean leaderboard percentile and comparative improvement versus the strongest baseline agent.
Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction.
Empirical comparisons reported in paper across multiple benchmark suites and tasks (BioML-Bench, GPT training optimization experiments, ProteinGym).
AutoScientists is a decentralized team of AI agents that interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration.
System design and implementation described in the paper (architecture and agent protocols); qualitative description of agent behaviors and coordination mechanisms; demonstrated in experiments.
We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.
Statement of the paper's contributions and contents (methodological description of what the paper includes).
By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation.
Stated objective/claim in the paper about the benchmark's purpose and what it measures (conceptual/goal-oriented statement).
OR-Space defines an Explain task mode, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts.
Definition of the Explain task mode provided in the paper (design/specification).
OR-Space defines a Revise task mode, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic.
Definition of the Revise task mode in the benchmark design (descriptive claim in the paper).
OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts.
Definition of one of the benchmark's task modes as described in the paper (method/design description).
Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files.
Design specification of OR-Space provided in the paper (descriptive claim about benchmark instance structure).
We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation.
Paper presents and names a new benchmark (methodological contribution described directly in the text).
Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling.
Statement in the paper asserting an observed trend; likely based on literature/context motivating the work (no empirical sample or quantitative citation provided in the excerpt).
A recommended organizational design for the AI era is the 'resonance protocol enterprise' in which structures are temporary crystallizations, AI governance protects adaptive openness, and legitimacy derives from sustaining recursive renewal.
Normative/proposal in the paper outlining a new organizational design paradigm; presented as conceptual design without empirical pilot or evaluation.
Digital transformation initially enhanced adaptability by fluidifying information flows and expanding relational connectivity, thereby improving some organizations' adaptability.
Theoretical claim supported by qualitative interpretation of digital transformation phenomena; no systematic measurement or reported sample.
Organizations capable of rapid relational reconfiguration, customer reconnection, and generative experimentation often proved more resilient during the pandemic.
Illustrative/theoretical interpretation of pandemic cases offered in the paper; no quantified sample or formal empirical evidence reported.
Although AI creates obstacles, it also has the potential to be an important tool for creating innovative opportunities and continued growth if managed with sound practices.
Concluding statement in the paper's abstract presenting a normative/conditional conclusion based on the paper's evaluation and synthesis of evidence (no primary quantified results provided in the supplied text).
AI leads to the creation of new jobs.
The paper explicitly states it examines the creation of new jobs as a ramification of AI (abstract); claim presented qualitatively without reported sample sizes or quantified effect in the provided text.
GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.
Architectural description in the paper; claim about knowledge base acting as ground truth and enabling capability compounding (design-level claim). No quantitative evaluation given in the abstract.
GENESIS is an agentic AI framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base.
System design / implementation claim presented in the paper (description of proposed framework). The abstract does not report empirical evaluation metrics or sample size.
Large Language Models (LLMs) have compressed comparable R&D work in general software engineering from days to minutes.
Paper's stated comparison/claim (likely based on prior reports or authors' experience); no experimental details or sample size provided in the abstract.
Operational reasoning paradigms such as ReasonOps may become foundational infrastructure for next-generation trustworthy AI ecosystems.
Author's forward-looking argument / conjecture about the potential future impact and adoption of operational reasoning paradigms; presented as an argument rather than demonstrated empirically in the excerpt.
The paper presents the ReasonOps architecture, demonstrates its workflow using an autonomous braking system analysis example, and discusses its potential role in future safety-critical autonomous AI systems.
Author statement about the paper's content and demonstration (explicitly claims an architecture and an example walkthrough); evidence is the paper's own descriptive content.
The proposed paradigm integrates semantic interpretation, autoformalization, symbolic reasoning, theorem proving, runtime assurance, probabilistic reliability estimation, and adaptive correction into a unified reasoning lifecycle.
Author claim about the architecture and components of ReasonOps; presented as a proposed integrated lifecycle in the paper (no empirical evaluation reported in excerpt).
ReasonOps treats reasoning as a continuously monitored, verifiable, reliability-aware operational process rather than an isolated inference task.
Author description of the ReasonOps paradigm and its operational stance (conceptual framework described in paper).
This paper introduces ReasonOps, a unified operational paradigm for trustworthy verified reasoning systems.
Declarative claim about the paper's contribution (introduction of a named paradigm); supported by the paper itself (architectural description and example claimed).
Recent advances in theorem proving, autoformalization, symbolic reasoning, and tool-augmented language models demonstrate substantial progress toward machine-assisted formal reasoning.
Author statement citing multiple research directions (theorem proving, autoformalization, symbolic reasoning, tool-augmented LMs); no specific empirical results or quantitative studies provided in excerpt.
Large Language Models (LLMs) have transformed artificial intelligence from primarily generative systems into increasingly capable reasoning agents.
Author assertion in paper's introduction; conceptual argument referencing recent developments in LLMs (no empirical study or sample size reported in text excerpt).
Agentic Technical Debt and Stochastic Tax are related but distinct: debt can amplify the tax.
Theoretical relationship asserted in the structural model; the note states debt can amplify the recurring Stochastic Tax and provides model expressions and discussion (and illustrative simulation) to substantiate the relationship.
Combining both levers yields a 502% improvement on single-cell RNA denoising over the initial baseline.
Reported experimental result in the paper comparing SIA to the initial baseline on the single-cell RNA denoising task (denoising metric unspecified in abstract).
Combining both levers yields a 91.9% runtime reduction on GPU kernels over the initial baseline.
Reported experimental result in the paper comparing SIA to the initial baseline on the low-level GPU kernel optimisation task (runtime measured).
Combining both levers yields a 56.6% gain on LawBench (Chinese legal charge classification) over the initial baseline.
Reported experimental result in the paper comparing SIA to the initial baseline on the LawBench task.
Combining both levers (harness updates and weight updates) outperforms scaffold iteration alone on all three benchmarks.
Empirical comparison reported in the paper: experiments across the three domains comparing SIA (combined harness+weight updates) against scaffold-iteration-only baseline.
There exists a data supply chain that runs from individual translators through language service providers (LSPs) and platforms to model developers.
Mapping and descriptive analysis of industry supply chains and intermediary roles provided in the paper; conceptual and empirical examples of flows of translation data from translators to model developers. No numerical sample reported.
Article 30-4 of the Japanese Copyright Act legitimates a mode of use the paper terms 'appropriation without consumption'—i.e., mining works for statistical features rather than reading or experiencing them.
Textual/legal analysis of Article 30-4 of the Japanese Copyright Act and its interpretation; comparative legal reading presented in the paper. No numerical sample reported.
The development of statistical machine translation (SMT), neural machine translation (NMT), the Transformer architecture, and multilingual large language models (LLMs) cannot be disentangled from the accumulation of translation data (TM/parallel corpora).
Historical and technical literature review linking MT/NLP methodological advances to the availability and use of parallel corpora and TM; comparative analysis of model development histories described in the paper. No numerical sample reported.
Translation memories (TM) and parallel corpora preserve a one-to-one correspondence between source and target text and therefore constitute extraordinarily valuable supervised training data for machine translation.
Conceptual argument and literature review of machine translation practice (discussion of TM/parallel corpora as supervised training data); examples and descriptive evidence from MT research and industry practice presented in the paper. No numerical sample reported.
EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.
Methodological claim that training is performed offline using recorded agent-to-agent interaction data rather than online interactions; described as part of framework benefits.
Transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments.
Reported transfer experiments in which EmoDistill-trained policies were evaluated on different negotiation domains, with unseen counterparties, and in tournaments between trained agents; results reportedly show generalization. (Exact metrics and sample sizes not provided in the excerpt.)
Ablations show that emotion conditioning is essential.
Ablation experiments reported in the paper removing or altering emotion conditioning, which reportedly degrade performance relative to the full EmoDistill model. (No numeric results provided in the excerpt.)
Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection.
Empirical evaluation across four negotiation domains comparing EmoDistill-trained SLM policies to vanilla SLM/LLM baselines and an ablated IQL-only emotion selector. (Paper reports comparative utility results, but exact sample sizes and numeric effect sizes are not provided in the excerpt.)
EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns which emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns how to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO).
Description of model architecture and training approach: IQL used as selector; LoRA-based policy trained with SFT and JPO for expression. (Design/implementation claim from methods section.)
We introduce EmoDistill, an offline framework for distilling emotional negotiation skills into language model agents.
Methodological contribution described in the paper: design and presentation of the EmoDistill framework (decomposition, training pipeline). This is a description of a proposed method rather than an empirical result.
Hybrid Fusion significantly accelerated the recovery of smaller Slow AI teams (+6.9% at N=4).
Reported intervention result: Hybrid Fusion produced a +6.9% acceleration in recovery for smaller Slow AI teams, reported at N=4.
Integrating these isolated veridical signals via Hybrid Fusion successfully rescued the Fast AI team (+7.6% at N=8).
Reported intervention result: application of Hybrid Fusion integration produced a +7.6% improvement in Fast AI team performance, reported at N=8.
The Riemannian Oracle adapted to task states by heavily restricting temporal windows (< 0.8s) to intercept fast reflexive compliance and widening windows (> 1.2s) to capture delayed cognitive conflict.
Reported algorithmic behavior of the 2D Adaptive Riemannian Oracle in response to measured spatial covariance: window sizes described as <0.8s for fast states and >1.2s for slow states.
In the Slow AI condition, behavioural teams (N=8) eventually recovered to 100.0%.
Reported team performance metric for behavioural teams in Slow AI condition with N=8; team performance reported to reach 100.0%.
We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.
Authors' stated aim/goal for the benchmark (normative/aspirational statement in the paper).
Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work.
Design description of task packaging in JobBench (benchmark construction/methodological detail).