Evidence (6491 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage.
Dataset creation procedure and reported coverage claim (200 software applications), taxonomy derived from U.S. GDP data as stated.
Environment creation is framed as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software while producing evidence of correct setup; an independent audit agent verifies evidence against a quality checklist.
Method description of multi-agent pipeline (coding agent + audit agent) in the paper.
We introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment.
Methodological contribution described in paper (framework implementation claimed).
The study introduces 'career reconfiguration' as a framework explaining intra-role task transformation, extending existing career mobility and job transition theories.
Theoretical/conceptual contribution presented in the paper (framework proposition; not an empirical effect).
Mediation analysis confirms that training and organizational support significantly mediate the relationship between AI adoption and career shifts.
Mediation analysis reported in the study (method stated; no mediation coefficients or sample size provided in abstract).
Together, these variables explain 61% of the variance in adaptive outcomes (R² = 0.61).
Multiple regression model summary reported in the paper (R-squared value provided; sample size not stated).
Readiness to change is a significant predictor of career adaptation (beta = 0.298, p = 0.011).
Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).
Openness to technology is a significant predictor of career adaptation (beta = 0.367, p = 0.003).
Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).
Organizational support is a significant predictor of career adaptation (beta = 0.389, p = 0.005).
Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).
Skills training is the strongest predictor of career adaptation (beta = 0.412, p = 0.002).
Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).
SWE-bench alignment: Bench is aligned with SWE-bench-Verified and SWE-bench-Pro.
Paper statement that the constructed benchmark is aligned with SWE-bench-Verified and SWE-bench-Pro (methodological/design alignment described).
Bench contains 495 issues and 1,787 validated design constraints across six repositories.
Reported dataset statistics in paper/abstract: explicit counts of issues (495), validated constraints (1,787), and number of repositories (6).
We construct DESIGN-AWARE benchmark (Bench) by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier.
Method description in paper: dataset created by mining real-world pull requests, validating constraints, linking constraints to issues, and using an LLM-based verifier to check compliance.
Flowr is domain-independent, offering a generalizable blueprint for agentic AI-driven supply chain automation across large-scale enterprise settings.
Claim of generalizability made by the authors in the paper; presented as an assertion rather than demonstrated through multi-industry empirical tests in the excerpt.
The framework was validated in collaboration with a large-scale supermarket chain.
Claim of field validation stated in the paper; indicates at least one real-world collaboration but provides no further details (e.g., number of stores, duration, metrics) in the excerpt.
Evaluation indicates Flowr enables proactive exception handling at a scale unachievable through manual processes.
Empirical/operational claim based on the paper's evaluation and deployment context; the excerpt asserts this capability but does not provide quantitative performance metrics or comparison details.
Evaluation shows Flowr improves demand–supply alignment.
Empirical claim in the paper's evaluation; reported improvement in demand-supply alignment from deployment or testing with a large supermarket chain, but no numerical metrics provided in the excerpt.
Evaluation demonstrates that Flowr significantly reduces manual coordination overhead.
Empirical claim reported in the paper's evaluation section; the excerpt notes an evaluation and collaboration with a large supermarket chain but provides no sample size figures or quantitative effect sizes.
Central to the framework is a human-in-the-loop orchestration model in which supply chain managers supervise and intervene across workflow stages via a Model Context Protocol (MCP)-enabled interface, preserving accountability and organizational control.
Design/organizational claim describing human-in-the-loop orchestration and MCP interface; asserted in the paper without empirical measures of accountability or control in the excerpt.
To ensure task accuracy and adherence to responsible AI principles, the framework employs a consortium of fine-tuned, domain-specialized large language models coordinated by a central reasoning LLM.
Technical/design claim in the paper describing model architecture and approach; no evaluation metrics or tests of accuracy/responsibility provided in the excerpt.
Flowr systematically decomposes manual supply chain operations into specialized AI agents, each responsible for a clearly defined cognitive role, enabling automation of processes previously dependent on continuous human coordination.
Architectural claim — asserted mechanism of the framework in the paper; presented as part of the framework design, no quantitative evaluation details in the excerpt.
This paper introduces Flowr, a novel agentic AI framework for automating end-to-end retail supply chain workflows in large-scale supermarket operations.
Design and system-proposal claim in the paper; supported by framework description rather than empirical testing in the provided text.
Generative AI helps users solve problems more efficiently.
Motivating empirical observation stated in the paper (no sample or empirical analysis reported in the provided text); assumption used to motivate the theoretical model.
By elucidating the mechanisms and trade-offs inherent in AI-human collaboration, this work lays a robust foundation for future research on adaptive decision systems.
Authors' forward-looking claim in the abstract that their synthesis clarifies mechanisms/trade-offs and thus supports subsequent research; based on their review and framework.
By synthesizing these paradigms, this research advances the theoretical understanding of hybrid decision-making systems and provides actionable insights for organizations navigating complex and AI-driven environments.
Authors' stated contribution based on the conceptual synthesis of the literature and the proposed framework (as reported in the abstract).
The framework introduces four distinct paradigms of AI-human collaborative decision-making: adaptive intuitive decision, programmed algorithmic decision, interpretive analytical decision and integrative hybrid decision.
Authors' conceptual taxonomy reported in the abstract, produced from synthesis of the reviewed literature (627 articles).
We developed a novel conceptual framework that identifies two critical dimensions, AI-human dynamics and decision typologies, that shape decision outcomes.
Authors' reported conceptual synthesis derived from the systematic review/bibliometric analysis of the 627 articles.
Prompts can be treated as decision policies that allocate discretion between researcher and system, governing what is executed and when iteration stops.
Methodological framing advanced by the authors describing prompts as decision policies; conceptual claim based on the paper's analytic framework rather than empirical measurement.
Operational constraints and decision rule prompts deliver large and stable footprint reductions while preserving decision equivalent topic outputs.
Experimental comparisons of prompt strategies in the benchmarked workflow showing reductions in runtime/CO2e and evaluated topic outputs' decision-equivalence (asserted in abstract; no numeric reductions or sample sizes provided).
We benchmark a modern economic survey workflow, an LDA-based literature mapping implemented with GenAI assisted coding and executed in a fixed cloud notebook, measuring runtime and estimated CO2e with CodeCarbon.
Experimental benchmark described in the paper: single implemented workflow (LDA-based literature mapping) executed in a fixed cloud notebook with runtime and CO2e measured using CodeCarbon (methodological claim).
Training footprint is the largest cluster in the mapped Green AI literature.
Result from the paper's literature mapping / clustering (statement in abstract; no numeric cluster sizes given).
We map the recent Green AI literature into seven themes: training footprint is the largest cluster, while inference efficiency and system level optimisation are growing rapidly, alongside measurement protocols, green algorithms, governance, and security and efficiency trade-offs.
Bibliometric / thematic mapping of recent Green AI literature described in the paper (method: literature mapping; exact number of papers or mapping procedure not specified in abstract).
We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.
Paper states intention and contribution: releasing methodology and lessons to allow replication by other organizations.
We detail data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks to address challenges in constructing reliable evaluation signals from monorepo environments.
Methodological description in paper listing specific practices (LLM-based classification, test relevance validation, multi-run stability checks) aimed at producing reliable evaluation signals in monorepos.
Models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates.
Reported relationship from paper's analysis correlating models' use of verification tools (test execution, static analysis) with higher solve rates across evaluated models.
Systematic analysis of four foundation models yields solve rates from 53.2% to 72.2%.
Empirical evaluation reported in paper: four foundation models were evaluated on the ProdCodeBench benchmark producing reported solve-rate range.
Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages.
Descriptive dataset claim in paper specifying components of each sample and that samples cover seven programming languages.
We present ProdCodeBench, a benchmark built from real sessions with a production AI coding assistant.
Paper describes methodology and introduces ProdCodeBench explicitly as constructed from real production assistant sessions.
Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings.
Argument presented in paper motivating creation of production-derived benchmark; no specific empirical comparison to alternative benchmarks reported in the abstract.
A representative incident (ISS-004) demonstrated boundary-based containment with 10-minute detection latency, zero user exposure, and 80-minute resolution.
Incident ISS-004 report in the paper giving specific timings for detection latency (10 minutes), user exposure (zero), and resolution (80 minutes).
The multi-agent approach improved reliability: audited handoffs detected and blocked a coordinate transformation error affecting all 2,452 stations before publication.
Incident detection reported in the SF2Bench deployment where audited handoffs prevented publication of a coordinate transformation error that would have affected all 2,452 stations.
The multi-agent approach improved efficiency — the SF2Bench deployment was completed by a single operator in two days with repeated artifact reuse across deployments.
Operational report from the production deployment: single operator completion time of two days and reuse of artifacts across deployments as stated in the paper.
SF2Bench, a compound flooding benchmark comprising 2,452 monitoring stations and 8,557 published files spanning 39 years, validates the multi-agent workflow.
Reported dataset composition and use in the paper: SF2Bench with stated counts and temporal span used to validate the multi-agent workflow.
EnviSmart treats reliability as an architectural property through two mechanisms: (1) a three-track knowledge architecture that externalizes behaviors (governance constraints), domain knowledge (retrievable context), and skills (tool-using procedures) as persistent, interlocking artifacts; and (2) a role-separated multi-agent design where deterministic validators and audited handoffs restore fail-stop semantics at trust boundaries before irreversible steps.
System architecture and design description in the paper; presented as the core reliability mechanisms implemented in EnviSmart.
We introduce EnviSmart, a production data management system deployed on campus-wide storage infrastructure for environmental research.
System description and statement of deployment in the paper; presented as a production deployment (no randomized evaluation reported).
Embedding LLM-driven agents into environmental FAIR data management can externalize operational knowledge and scale curation across heterogeneous data and evolving conventions.
Conceptual / argumentative claim made in the paper as a motivation for the system; no quantitative experiment tied to this statement in the excerpt.
Overcoming the structural skill deficit through deliberate investment in tertiary education reform and strong private-public partnerships for continuous vocational learning is mandatory for Nigeria to successfully leverage the AI revolution for inclusive economic growth and ensure long-term workforce resilience.
Study conclusion synthesizing survey results (150 firms) and qualitative policy/workforce analysis to make policy recommendations.
The rate of new job creation hinges critically on the immediate implementation of targeted, scalable reskilling programs.
Paper's projections and analysis drawing on the survey of 150 firms and qualitative interviews; presented as a conditional/projection based on current skills gap and training initiatives.
The agentic-specificity classification helps organizations distinguish challenges that require novel approaches from those that are addressable with established practices.
Authors' proposed classification (agentic-specific vs. carried-over/amplified) intended as a practical decision aid; derived from the coding and comparative analysis.
The taxonomy provides a diagnostic framework for identifying priority barrier dimensions and understanding cross-dimensional amplification mechanisms.
Authors present a taxonomy derived from the review and claim it can be used diagnostically by organizations; supported by the coded barrier classification and STS mapping.