Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage.

Dataset creation procedure and reported coverage claim (200 software applications), taxonomy derived from U.S. GDP data as stated.

high positive Gym-Anything: Turn any Software into an Agent Environment number of software applications covered and occupational coverage

Environment creation is framed as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software while producing evidence of correct setup; an independent audit agent verifies evidence against a quality checklist.

Method description of multi-agent pipeline (coding agent + audit agent) in the paper.

high positive Gym-Anything: Turn any Software into an Agent Environment reliability/validity of environment setup via multi-agent workflow

We introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment.

Methodological contribution described in paper (framework implementation claimed).

high positive Gym-Anything: Turn any Software into an Agent Environment availability of a general framework for environment creation

The study introduces 'career reconfiguration' as a framework explaining intra-role task transformation, extending existing career mobility and job transition theories.

Theoretical/conceptual contribution presented in the paper (framework proposition; not an empirical effect).

high positive Artificial Intelligence Adoption and Career Reconfiguration ... theoretical framing of intra-role task transformation (career reconfiguration)

Mediation analysis confirms that training and organizational support significantly mediate the relationship between AI adoption and career shifts.

Mediation analysis reported in the study (method stated; no mediation coefficients or sample size provided in abstract).

high positive Artificial Intelligence Adoption and Career Reconfiguration ... career shifts (mediated effect of training and organizational support on relatio...

Together, these variables explain 61% of the variance in adaptive outcomes (R² = 0.61).

Multiple regression model summary reported in the paper (R-squared value provided; sample size not stated).

high positive Artificial Intelligence Adoption and Career Reconfiguration ... variance explained in adaptive outcomes (career adaptation)

Readiness to change is a significant predictor of career adaptation (beta = 0.298, p = 0.011).

Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).

high positive Artificial Intelligence Adoption and Career Reconfiguration ... career adaptation / adaptive outcomes

Openness to technology is a significant predictor of career adaptation (beta = 0.367, p = 0.003).

Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).

high positive Artificial Intelligence Adoption and Career Reconfiguration ... career adaptation / adaptive outcomes

Organizational support is a significant predictor of career adaptation (beta = 0.389, p = 0.005).

Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).

high positive Artificial Intelligence Adoption and Career Reconfiguration ... career adaptation / adaptive outcomes

Skills training is the strongest predictor of career adaptation (beta = 0.412, p = 0.002).

Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).

high positive Artificial Intelligence Adoption and Career Reconfiguration ... career adaptation / adaptive outcomes

SWE-bench alignment: Bench is aligned with SWE-bench-Verified and SWE-bench-Pro.

Paper statement that the constructed benchmark is aligned with SWE-bench-Verified and SWE-bench-Pro (methodological/design alignment described).

high positive Does Pass Rate Tell the Whole Story? Evaluating Design Const... benchmark alignment

Bench contains 495 issues and 1,787 validated design constraints across six repositories.

Reported dataset statistics in paper/abstract: explicit counts of issues (495), validated constraints (1,787), and number of repositories (6).

high positive Does Pass Rate Tell the Whole Story? Evaluating Design Const... other

We construct DESIGN-AWARE benchmark (Bench) by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier.

Method description in paper: dataset created by mining real-world pull requests, validating constraints, linking constraints to issues, and using an LLM-based verifier to check compliance.

high positive Does Pass Rate Tell the Whole Story? Evaluating Design Const... other

Flowr is domain-independent, offering a generalizable blueprint for agentic AI-driven supply chain automation across large-scale enterprise settings.

Claim of generalizability made by the authors in the paper; presented as an assertion rather than demonstrated through multi-industry empirical tests in the excerpt.

high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... generalizability / applicability across domains

The framework was validated in collaboration with a large-scale supermarket chain.

Claim of field validation stated in the paper; indicates at least one real-world collaboration but provides no further details (e.g., number of stores, duration, metrics) in the excerpt.

high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... field validation / real-world deployment

Evaluation indicates Flowr enables proactive exception handling at a scale unachievable through manual processes.

Empirical/operational claim based on the paper's evaluation and deployment context; the excerpt asserts this capability but does not provide quantitative performance metrics or comparison details.

high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... proactive exception handling capability and scale

Evaluation shows Flowr improves demand–supply alignment.

Empirical claim in the paper's evaluation; reported improvement in demand-supply alignment from deployment or testing with a large supermarket chain, but no numerical metrics provided in the excerpt.

high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... demand–supply alignment

Evaluation demonstrates that Flowr significantly reduces manual coordination overhead.

Empirical claim reported in the paper's evaluation section; the excerpt notes an evaluation and collaboration with a large supermarket chain but provides no sample size figures or quantitative effect sizes.

high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... manual coordination overhead (effort/time/coordination burden)

Central to the framework is a human-in-the-loop orchestration model in which supply chain managers supervise and intervene across workflow stages via a Model Context Protocol (MCP)-enabled interface, preserving accountability and organizational control.

Design/organizational claim describing human-in-the-loop orchestration and MCP interface; asserted in the paper without empirical measures of accountability or control in the excerpt.

high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... preservation of accountability and organizational control during automation

To ensure task accuracy and adherence to responsible AI principles, the framework employs a consortium of fine-tuned, domain-specialized large language models coordinated by a central reasoning LLM.

Technical/design claim in the paper describing model architecture and approach; no evaluation metrics or tests of accuracy/responsibility provided in the excerpt.

high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... task accuracy and adherence to responsible AI principles

Flowr systematically decomposes manual supply chain operations into specialized AI agents, each responsible for a clearly defined cognitive role, enabling automation of processes previously dependent on continuous human coordination.

Architectural claim — asserted mechanism of the framework in the paper; presented as part of the framework design, no quantitative evaluation details in the excerpt.

high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... task decomposition and automation of previously human-coordinated processes

This paper introduces Flowr, a novel agentic AI framework for automating end-to-end retail supply chain workflows in large-scale supermarket operations.

Design and system-proposal claim in the paper; supported by framework description rather than empirical testing in the provided text.

high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... ability to automate end-to-end supply chain workflows (task allocation to AI)

Generative AI helps users solve problems more efficiently.

Motivating empirical observation stated in the paper (no sample or empirical analysis reported in the provided text); assumption used to motivate the theoretical model.

high positive When AI Improves Answers but Slows Knowledge Creation: Match... problem-solving efficiency (implicit)

By elucidating the mechanisms and trade-offs inherent in AI-human collaboration, this work lays a robust foundation for future research on adaptive decision systems.

Authors' forward-looking claim in the abstract that their synthesis clarifies mechanisms/trade-offs and thus supports subsequent research; based on their review and framework.

high positive Advancing Decision-Making through AI-Human Collaboration: A ... foundation for future research on adaptive decision systems

By synthesizing these paradigms, this research advances the theoretical understanding of hybrid decision-making systems and provides actionable insights for organizations navigating complex and AI-driven environments.

Authors' stated contribution based on the conceptual synthesis of the literature and the proposed framework (as reported in the abstract).

high positive Advancing Decision-Making through AI-Human Collaboration: A ... theoretical advancement and provision of actionable organizational insights

The framework introduces four distinct paradigms of AI-human collaborative decision-making: adaptive intuitive decision, programmed algorithmic decision, interpretive analytical decision and integrative hybrid decision.

Authors' conceptual taxonomy reported in the abstract, produced from synthesis of the reviewed literature (627 articles).

high positive Advancing Decision-Making through AI-Human Collaboration: A ... classification of AI-human collaborative decision-making into four paradigms

We developed a novel conceptual framework that identifies two critical dimensions, AI-human dynamics and decision typologies, that shape decision outcomes.

Authors' reported conceptual synthesis derived from the systematic review/bibliometric analysis of the 627 articles.

high positive Advancing Decision-Making through AI-Human Collaboration: A ... identification of critical dimensions affecting decision outcomes

Prompts can be treated as decision policies that allocate discretion between researcher and system, governing what is executed and when iteration stops.

Methodological framing advanced by the authors describing prompts as decision policies; conceptual claim based on the paper's analytic framework rather than empirical measurement.

high positive On the Carbon Footprint of Economic Research in the Age of G... conceptualization of prompts' role in workflow control and decision allocation

Operational constraints and decision rule prompts deliver large and stable footprint reductions while preserving decision equivalent topic outputs.

Experimental comparisons of prompt strategies in the benchmarked workflow showing reductions in runtime/CO2e and evaluated topic outputs' decision-equivalence (asserted in abstract; no numeric reductions or sample sizes provided).

high positive On the Carbon Footprint of Economic Research in the Age of G... carbon footprint / runtime reductions and preservation of topic output equivalen...

We benchmark a modern economic survey workflow, an LDA-based literature mapping implemented with GenAI assisted coding and executed in a fixed cloud notebook, measuring runtime and estimated CO2e with CodeCarbon.

Experimental benchmark described in the paper: single implemented workflow (LDA-based literature mapping) executed in a fixed cloud notebook with runtime and CO2e measured using CodeCarbon (methodological claim).

high positive On the Carbon Footprint of Economic Research in the Age of G... runtime and estimated CO2e (carbon footprint) of the benchmarked workflow

Training footprint is the largest cluster in the mapped Green AI literature.

Result from the paper's literature mapping / clustering (statement in abstract; no numeric cluster sizes given).

high positive On the Carbon Footprint of Economic Research in the Age of G... relative prevalence (cluster size) of 'training footprint' theme

We map the recent Green AI literature into seven themes: training footprint is the largest cluster, while inference efficiency and system level optimisation are growing rapidly, alongside measurement protocols, green algorithms, governance, and security and efficiency trade-offs.

Bibliometric / thematic mapping of recent Green AI literature described in the paper (method: literature mapping; exact number of papers or mapping procedure not specified in abstract).

high positive On the Carbon Footprint of Economic Research in the Age of G... distribution of themes within Green AI literature (theme prevalence and growth)

We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.

Paper states intention and contribution: releasing methodology and lessons to allow replication by other organizations.

high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... ability of other organizations to construct similar benchmarks

We detail data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks to address challenges in constructing reliable evaluation signals from monorepo environments.

Methodological description in paper listing specific practices (LLM-based classification, test relevance validation, multi-run stability checks) aimed at producing reliable evaluation signals in monorepos.

high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... reliability of evaluation signals derived from monorepo environments

Models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates.

Reported relationship from paper's analysis correlating models' use of verification tools (test execution, static analysis) with higher solve rates across evaluated models.

high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... solve rate (task success) as a function of verification tool usage

Systematic analysis of four foundation models yields solve rates from 53.2% to 72.2%.

Empirical evaluation reported in paper: four foundation models were evaluated on the ProdCodeBench benchmark producing reported solve-rate range.

high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... solve rate (task success rate)

Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages.

Descriptive dataset claim in paper specifying components of each sample and that samples cover seven programming languages.

high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... dataset composition (prompt, code change, tests) and language coverage (7 langua...

We present ProdCodeBench, a benchmark built from real sessions with a production AI coding assistant.

Paper describes methodology and introduces ProdCodeBench explicitly as constructed from real production assistant sessions.

high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... existence and provenance of benchmark (production-derived dataset)

Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings.

Argument presented in paper motivating creation of production-derived benchmark; no specific empirical comparison to alternative benchmarks reported in the abstract.

high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... quality of evaluation for AI coding agents (suitability of benchmark)

A representative incident (ISS-004) demonstrated boundary-based containment with 10-minute detection latency, zero user exposure, and 80-minute resolution.

Incident ISS-004 report in the paper giving specific timings for detection latency (10 minutes), user exposure (zero), and resolution (80 minutes).

high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... incident detection latency, user exposure, and time-to-resolution

The multi-agent approach improved reliability: audited handoffs detected and blocked a coordinate transformation error affecting all 2,452 stations before publication.

Incident detection reported in the SF2Bench deployment where audited handoffs prevented publication of a coordinate transformation error that would have affected all 2,452 stations.

high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... detection/blocking of a systemic coordinate transformation error (error preventi...

The multi-agent approach improved efficiency — the SF2Bench deployment was completed by a single operator in two days with repeated artifact reuse across deployments.

Operational report from the production deployment: single operator completion time of two days and reuse of artifacts across deployments as stated in the paper.

high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... time to complete deployment (task completion time) and operator effort

SF2Bench, a compound flooding benchmark comprising 2,452 monitoring stations and 8,557 published files spanning 39 years, validates the multi-agent workflow.

Reported dataset composition and use in the paper: SF2Bench with stated counts and temporal span used to validate the multi-agent workflow.

high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... scale and temporal coverage of benchmark used to validate workflow (stations, fi...

EnviSmart treats reliability as an architectural property through two mechanisms: (1) a three-track knowledge architecture that externalizes behaviors (governance constraints), domain knowledge (retrievable context), and skills (tool-using procedures) as persistent, interlocking artifacts; and (2) a role-separated multi-agent design where deterministic validators and audited handoffs restore fail-stop semantics at trust boundaries before irreversible steps.

System architecture and design description in the paper; presented as the core reliability mechanisms implemented in EnviSmart.

high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... architectural approach to reliability (design features implemented)

We introduce EnviSmart, a production data management system deployed on campus-wide storage infrastructure for environmental research.

System description and statement of deployment in the paper; presented as a production deployment (no randomized evaluation reported).

high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... existence and production deployment of EnviSmart

Embedding LLM-driven agents into environmental FAIR data management can externalize operational knowledge and scale curation across heterogeneous data and evolving conventions.

Conceptual / argumentative claim made in the paper as a motivation for the system; no quantitative experiment tied to this statement in the excerpt.

high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... ability to externalize operational knowledge and scale curation

Overcoming the structural skill deficit through deliberate investment in tertiary education reform and strong private-public partnerships for continuous vocational learning is mandatory for Nigeria to successfully leverage the AI revolution for inclusive economic growth and ensure long-term workforce resilience.

Study conclusion synthesizing survey results (150 firms) and qualitative policy/workforce analysis to make policy recommendations.

high positive Human Capital and the AI-Powered Future of Work: (Training, ... inclusive economic growth and long-term workforce resilience

The rate of new job creation hinges critically on the immediate implementation of targeted, scalable reskilling programs.

Paper's projections and analysis drawing on the survey of 150 firms and qualitative interviews; presented as a conditional/projection based on current skills gap and training initiatives.

high positive Human Capital and the AI-Powered Future of Work: (Training, ... rate of new job creation

The agentic-specificity classification helps organizations distinguish challenges that require novel approaches from those that are addressable with established practices.

Authors' proposed classification (agentic-specific vs. carried-over/amplified) intended as a practical decision aid; derived from the coding and comparative analysis.

high positive BARRIERS TO AGENTIC AI ENTERPRISE TRANSFORMATION practical_utility_of_agentic_specificity_classification

The taxonomy provides a diagnostic framework for identifying priority barrier dimensions and understanding cross-dimensional amplification mechanisms.

Authors present a taxonomy derived from the review and claim it can be used diagnostically by organizations; supported by the coded barrier classification and STS mapping.

high positive BARRIERS TO AGENTIC AI ENTERPRISE TRANSFORMATION usefulness_of_taxonomy_for_diagnosis

« Prev 1 2 3 … 77 78 79 … 129 130 Next »