Evidence (13827 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	195	97	889	1979
Governance & Regulation	815	391	188	121	1539
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	624	233	123	96	1084
Research Productivity	410	121	56	331	929
Output Quality	466	177	59	47	749
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	166	122	24	495
Task Allocation	206	64	70	31	376
Skill Acquisition	165	57	60	17	299
Innovation Output	201	27	41	18	288
Employment Level	105	51	107	13	278
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	149	46	26	3	224
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	61	20	12	182
Error Rate	69	91	10	2	172
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	92	19	13	19	145
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Skill Obsolescence	5	45	6	1	57
Creative Output	31	16	7	2	57
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

A Wright's Law fit (n = 82 artifacts, p < 0.01) shows production acceleration across the artifact portfolio.

Quantitative model reported in the paper: Wright's Law fit on 82 artifacts with reported p-value < 0.01.

high positive Augment Engineering: A Methodology for Multi-Tool AI Orchest... production acceleration (learning curve effects) across produced artifacts

A Cochran-Armitage trend test (n = 200 interactions across two chat LLMs, p < 0.01) shows first-pass acceptance rising with prompt-sophistication level.

Quantitative test reported in the paper: Cochran-Armitage trend test on 200 interactions across two chat LLMs, reported p-value < 0.01.

high positive Augment Engineering: A Methodology for Multi-Tool AI Orchest... first-pass acceptance rate of generated outputs as a function of prompt sophisti...

A 5-month formative case study (Nov 2025 to Mar 2026) documents a single practitioner applying Augment Engineering skills across a ten-component orchestration stack spanning seven professional domains, producing work products that would traditionally involve separate domain specialists.

Case study reported in the paper describing one practitioner's activities over five months across a 10-component stack in seven domains; sample size = 1 practitioner.

high positive Augment Engineering: A Methodology for Multi-Tool AI Orchest... ability of one practitioner to produce cross-domain work products that tradition...

The paper presents a six-phase orchestration methodology and four portability metrics for Augment Engineering.

Stated methodological contribution within the paper (description of methodology and metrics).

high positive Augment Engineering: A Methodology for Multi-Tool AI Orchest... methodology and metrics for orchestration and portability

Augment Engineering is a discipline of orchestrating multiple purpose-built AI tools across distinct professional domains, applying prompt and context engineering as portable competencies that transfer across tool boundaries.

Definition and conceptual development presented in the paper (methodological contribution).

high positive Augment Engineering: A Methodology for Multi-Tool AI Orchest... existence/definition of a new discipline (Augment Engineering)

Prompt engineering (interaction-level optimization) and context engineering (structured input pipeline design) are domain-portable meta-skills: a practitioner who masters them can apply them to any purpose-built AI tool in any domain.

Conceptual claim supported by the paper's argumentation and exemplified by a single-practitioner case study.

high positive Augment Engineering: A Methodology for Multi-Tool AI Orchest... portability of prompt and context engineering skills across tools and domains

The framework has implications for digital health, education, AI personalisation, and personal agency.

Authors' discussion in paper of potential implications across these application domains; presented qualitatively.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... implications for listed application domains

The authors list six operational requirements for state-aware systems.

Explicit statement in paper that six operational requirements are listed; descriptive rather than empirically tested in abstract.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... number of operational requirements

The authors derive seven testable predictions from the state-aware framework.

Explicit statement in paper that seven testable predictions are derived from the framework; no individual prediction effects quantified in abstract.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... number of derived testable predictions

The paper is supported by a 24-month observational base from a deployed behavioural platform spanning more than 200,000 consented users across four occupational personas (research period 2023 to 2026).

Empirical dataset described in the paper: observational deployment over 24 months, >200,000 consented users, four occupational personas, timeframe given (2023–2026).

high positive You Are in Control of Your State: Why Human Outcomes Are Con... existence and scale of observational dataset

The framework is motivated by six strands of established evidence: causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, and computational psychiatry.

Explicit statement in paper describing the literature strands used to motivate the framework.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... theoretical grounding of framework

Taken together, these claims imply that the outcome of a given event is controllable, conditionally, on the state-trajectory at the time of intervention.

Synthesis/implication drawn by authors from the conceptual framework and the six literature strands; argued but not quantified in abstract.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... conditional controllability of event outcomes

The conscious channel through which outcomes are reportable is a narrow attentional bottleneck whose contents are themselves state-dependent.

Theoretical claim supported by attentional bottleneck literature cited in the paper; presented as part of the conceptual framework.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... attentional bottleneck content dependency on state

The weighting vector (state) is dynamic at sub-daily timescales.

Claim motivated by chronobiology and related literature cited in the paper; authors state the sub-daily dynamism as part of their framework.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... temporal dynamics of latent state

The relationship between state, decision, and outcome is causal rather than correlational.

Argument grounded in causal inference literature cited by the authors; presented as a core theoretical claim in the paper rather than demonstrated by a specific randomized experiment in the abstract.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... causal influence of state on decisions/outcomes

A state can be defined as the time-indexed weighting vector over the dimensions that govern how an individual's biology, physiology, and neuropsychology process the next event into a decision and an outcome.

Explicit definitional claim / framework component introduced by the authors; justified conceptually via multidisciplinary literature cited in the paper.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... conceptual definition of latent state

Human outcomes are controllable in a precise and operational sense through interventions that target the state and its weighting at the moment a decision is being formed.

Theoretical argument in the paper, motivated by the six literature strands; supported in part by the authors' deployed behavioural platform (see separate claim about dataset) but no randomized effect sizes reported in abstract.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... controllability of outcomes via state-targeted interventions

This persistent variability belongs in a dynamic latent state of the person (i.e., is best modelled as a time-varying latent state).

Conceptual claim supported by integration of six strands of established evidence (causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, computational psychiatry) cited in the paper.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... attribution of outcome variance to latent state

Within-person variability persists: the same individual, presented with the same observable input, produces different outcomes on different occasions, and different individuals produce divergent outcomes that no observable covariate fully predicts.

Statement motivated by literature review across behavioural sciences; argued in paper as empirical puzzle rather than proven with new statistics in this manuscript.

high positive You Are in Control of Your State: Why Human Outcomes Are Con... variation in individual outcomes / decisions

Agents share successes and failures to reduce redundant exploration during long-running experiments.

Design of AutoScientists includes mechanisms for recording and sharing experimental outcomes; asserted benefit in paper that this reduces redundant exploration (qualitative and supported by experimental comparisons).

high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... redundant exploration (qualitative/system-level reduction)

Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).

Empirical evaluation across all 217 assays in the ProteinGym benchmark; reported aggregate improvement in Spearman correlation versus prior state-of-the-art.

high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... Spearman correlation averaged across 217 ProteinGym assays

On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation.

Empirical evaluation on the ACE2-Spike assay within the ProteinGym benchmark; reported relative improvement in Spearman correlation versus prior state-of-the-art.

high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... Spearman correlation on ACE2-Spike binding fitness prediction

On GPT training optimization, AutoScientists continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements).

Empirical comparison of discovered/accepted improvements during GPT training optimization; counts of accepted improvements for AutoScientists (7) versus single-agent approach (0).

high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... count of accepted improvements discovered

On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch.

Empirical training-time comparison between AutoScientists and Autoresearch on GPT training optimization tasks; reported speedup multiplier to reach a validation bits-per-byte target.

high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... time-to-target (validation bits-per-byte)

On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%.

Empirical evaluation on the BioML-Bench benchmark (24 tasks); reported mean leaderboard percentile and comparative improvement versus the strongest baseline agent.

high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... leaderboard percentile across benchmark tasks

Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction.

Empirical comparisons reported in paper across multiple benchmark suites and tasks (BioML-Bench, GPT training optimization experiments, ProteinGym).

high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... overall performance across multiple benchmarks

AutoScientists is a decentralized team of AI agents that interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration.

System design and implementation described in the paper (architecture and agent protocols); qualitative description of agent behaviors and coordination mechanisms; demonstrated in experiments.

high positive AutoScientists: Self-Organizing Agent Teams for Long-Running... agent coordination and information sharing (qualitative description)

We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

Statement of the paper's contributions and contents (methodological description of what the paper includes).

high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... capability to study reliability, failure modes, and readiness of LLM agents

By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation.

Stated objective/claim in the paper about the benchmark's purpose and what it measures (conceptual/goal-oriented statement).

high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... reliability of LLM agents in performing optimization work (beyond text generatio...

OR-Space defines an Explain task mode, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts.

Definition of the Explain task mode provided in the paper (design/specification).

high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... ability to generate grounded explanations using workspace evidence

OR-Space defines a Revise task mode, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic.

Definition of the Revise task mode in the benchmark design (descriptive claim in the paper).

high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... ability to revise models while preserving prior logic

OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts.

Definition of one of the benchmark's task modes as described in the paper (method/design description).

high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... ability to construct solver-ready models

Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files.

Design specification of OR-Space provided in the paper (descriptive claim about benchmark instance structure).

high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... complexity and composition of benchmark instances

We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation.

Paper presents and names a new benchmark (methodological contribution described directly in the text).

high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... capability of benchmarks to evaluate OR agents across lifecycle tasks

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling.

Statement in the paper asserting an observed trend; likely based on literature/context motivating the work (no empirical sample or quantitative citation provided in the excerpt).

high positive OR-Space: A Full-Lifecycle Workspace Benchmark for Industria... LLM agent adoption in OR workflows

A recommended organizational design for the AI era is the 'resonance protocol enterprise' in which structures are temporary crystallizations, AI governance protects adaptive openness, and legitimacy derives from sustaining recursive renewal.

Normative/proposal in the paper outlining a new organizational design paradigm; presented as conceptual design without empirical pilot or evaluation.

high positive The Lantern in the Vault: AI, Crisis, and the Ontology of Or... organizational design aimed at sustaining adaptive renewal and legitimacy under ...

Digital transformation initially enhanced adaptability by fluidifying information flows and expanding relational connectivity, thereby improving some organizations' adaptability.

Theoretical claim supported by qualitative interpretation of digital transformation phenomena; no systematic measurement or reported sample.

high positive The Lantern in the Vault: AI, Crisis, and the Ontology of Or... organizational adaptability associated with digital transformation practices

Organizations capable of rapid relational reconfiguration, customer reconnection, and generative experimentation often proved more resilient during the pandemic.

Illustrative/theoretical interpretation of pandemic cases offered in the paper; no quantified sample or formal empirical evidence reported.

high positive The Lantern in the Vault: AI, Crisis, and the Ontology of Or... organizational resilience as a function of relational reconfiguration and experi...

Although AI creates obstacles, it also has the potential to be an important tool for creating innovative opportunities and continued growth if managed with sound practices.

Concluding statement in the paper's abstract presenting a normative/conditional conclusion based on the paper's evaluation and synthesis of evidence (no primary quantified results provided in the supplied text).

high positive Impact of Artificial Intelligence on Employment and Society innovation opportunities and continued economic/organizational growth under soun...

AI leads to the creation of new jobs.

The paper explicitly states it examines the creation of new jobs as a ramification of AI (abstract); claim presented qualitatively without reported sample sizes or quantified effect in the provided text.

high positive Impact of Artificial Intelligence on Employment and Society creation of new jobs / net employment effects

GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.

Architectural description in the paper; claim about knowledge base acting as ground truth and enabling capability compounding (design-level claim). No quantitative evaluation given in the abstract.

high positive GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesi... accumulation/compounding of capabilities across runs (longitudinal improvement o...

GENESIS is an agentic AI framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base.

System design / implementation claim presented in the paper (description of proposed framework). The abstract does not report empirical evaluation metrics or sample size.

high positive GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesi... ability to produce solutions validated by over-the-air experiments (end-to-end R...

Large Language Models (LLMs) have compressed comparable R&D work in general software engineering from days to minutes.

Paper's stated comparison/claim (likely based on prior reports or authors' experience); no experimental details or sample size provided in the abstract.

high positive GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesi... time to complete R&D/software engineering tasks

Operational reasoning paradigms such as ReasonOps may become foundational infrastructure for next-generation trustworthy AI ecosystems.

Author's forward-looking argument / conjecture about the potential future impact and adoption of operational reasoning paradigms; presented as an argument rather than demonstrated empirically in the excerpt.

high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... future adoption / foundational role of operational reasoning paradigms

The paper presents the ReasonOps architecture, demonstrates its workflow using an autonomous braking system analysis example, and discusses its potential role in future safety-critical autonomous AI systems.

Author statement about the paper's content and demonstration (explicitly claims an architecture and an example walkthrough); evidence is the paper's own descriptive content.

high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... presence of architecture and example demonstration in the paper

The proposed paradigm integrates semantic interpretation, autoformalization, symbolic reasoning, theorem proving, runtime assurance, probabilistic reliability estimation, and adaptive correction into a unified reasoning lifecycle.

Author claim about the architecture and components of ReasonOps; presented as a proposed integrated lifecycle in the paper (no empirical evaluation reported in excerpt).

high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... integration of multiple reasoning and assurance components

ReasonOps treats reasoning as a continuously monitored, verifiable, reliability-aware operational process rather than an isolated inference task.

Author description of the ReasonOps paradigm and its operational stance (conceptual framework described in paper).

high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... operationalization of reasoning processes (monitoring, verification, reliability...

This paper introduces ReasonOps, a unified operational paradigm for trustworthy verified reasoning systems.

Declarative claim about the paper's contribution (introduction of a named paradigm); supported by the paper itself (architectural description and example claimed).

high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... existence/introduction of an operational paradigm (ReasonOps)

Recent advances in theorem proving, autoformalization, symbolic reasoning, and tool-augmented language models demonstrate substantial progress toward machine-assisted formal reasoning.

Author statement citing multiple research directions (theorem proving, autoformalization, symbolic reasoning, tool-augmented LMs); no specific empirical results or quantitative studies provided in excerpt.

high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... progress toward machine-assisted formal reasoning

Large Language Models (LLMs) have transformed artificial intelligence from primarily generative systems into increasingly capable reasoning agents.

Author assertion in paper's introduction; conceptual argument referencing recent developments in LLMs (no empirical study or sample size reported in text excerpt).

high positive ReasonOps: A Unified Operational Paradigm for Trustworthy Ve... capability of LLMs to perform reasoning

« Prev 1 2 3 … 102 103 104 … 276 277 Next »