The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6491 claims)

Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 758 199 100 900 2007
Governance & Regulation 826 400 191 122 1563
Organizational Efficiency 777 193 124 84 1189
Technology Adoption Rate 635 233 124 97 1098
Research Productivity 422 128 57 336 954
Output Quality 476 179 59 47 761
Decision Quality 328 177 81 47 640
Firm Productivity 435 57 88 20 606
AI Safety & Ethics 218 277 65 33 599
Market Structure 180 170 123 24 502
Task Allocation 213 64 72 33 387
Skill Acquisition 170 61 61 17 309
Innovation Output 203 27 43 18 292
Employment Level 105 54 107 13 281
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 117 63 42 11 233
Firm Revenue 153 48 26 3 230
Task Completion Time 173 31 8 12 225
Inequality Measures 44 122 49 6 221
Worker Satisfaction 89 65 22 12 188
Error Rate 69 92 10 2 173
Regulatory Compliance 77 69 14 5 165
Automation Exposure 56 56 26 13 154
Training Effectiveness 94 21 13 19 149
Wages & Compensation 77 36 25 6 144
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 80 20 1 113
Hiring & Recruitment 52 7 8 3 70
Creative Output 31 18 8 3 61
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Human Ai Collab Remove filter
Using a taxonomy of economically valuable occupations grounded in U.S. GDP data, we apply this pipeline to 200 software applications with broad occupational coverage.
Dataset creation procedure and reported coverage claim (200 software applications), taxonomy derived from U.S. GDP data as stated.
high positive Gym-Anything: Turn any Software into an Agent Environment number of software applications covered and occupational coverage
Environment creation is framed as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software while producing evidence of correct setup; an independent audit agent verifies evidence against a quality checklist.
Method description of multi-agent pipeline (coding agent + audit agent) in the paper.
high positive Gym-Anything: Turn any Software into an Agent Environment reliability/validity of environment setup via multi-agent workflow
We introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment.
Methodological contribution described in paper (framework implementation claimed).
high positive Gym-Anything: Turn any Software into an Agent Environment availability of a general framework for environment creation
The study introduces 'career reconfiguration' as a framework explaining intra-role task transformation, extending existing career mobility and job transition theories.
Theoretical/conceptual contribution presented in the paper (framework proposition; not an empirical effect).
high positive Artificial Intelligence Adoption and Career Reconfiguration ... theoretical framing of intra-role task transformation (career reconfiguration)
Mediation analysis confirms that training and organizational support significantly mediate the relationship between AI adoption and career shifts.
Mediation analysis reported in the study (method stated; no mediation coefficients or sample size provided in abstract).
high positive Artificial Intelligence Adoption and Career Reconfiguration ... career shifts (mediated effect of training and organizational support on relatio...
Together, these variables explain 61% of the variance in adaptive outcomes (R² = 0.61).
Multiple regression model summary reported in the paper (R-squared value provided; sample size not stated).
high positive Artificial Intelligence Adoption and Career Reconfiguration ... variance explained in adaptive outcomes (career adaptation)
Readiness to change is a significant predictor of career adaptation (beta = 0.298, p = 0.011).
Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).
high positive Artificial Intelligence Adoption and Career Reconfiguration ... career adaptation / adaptive outcomes
Openness to technology is a significant predictor of career adaptation (beta = 0.367, p = 0.003).
Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).
high positive Artificial Intelligence Adoption and Career Reconfiguration ... career adaptation / adaptive outcomes
Organizational support is a significant predictor of career adaptation (beta = 0.389, p = 0.005).
Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).
high positive Artificial Intelligence Adoption and Career Reconfiguration ... career adaptation / adaptive outcomes
Skills training is the strongest predictor of career adaptation (beta = 0.412, p = 0.002).
Multiple regression analysis reported in the paper (predictors of career adaptation; sample size not stated).
high positive Artificial Intelligence Adoption and Career Reconfiguration ... career adaptation / adaptive outcomes
SWE-bench alignment: Bench is aligned with SWE-bench-Verified and SWE-bench-Pro.
Paper statement that the constructed benchmark is aligned with SWE-bench-Verified and SWE-bench-Pro (methodological/design alignment described).
Bench contains 495 issues and 1,787 validated design constraints across six repositories.
Reported dataset statistics in paper/abstract: explicit counts of issues (495), validated constraints (1,787), and number of repositories (6).
We construct DESIGN-AWARE benchmark (Bench) by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier.
Method description in paper: dataset created by mining real-world pull requests, validating constraints, linking constraints to issues, and using an LLM-based verifier to check compliance.
Flowr is domain-independent, offering a generalizable blueprint for agentic AI-driven supply chain automation across large-scale enterprise settings.
Claim of generalizability made by the authors in the paper; presented as an assertion rather than demonstrated through multi-industry empirical tests in the excerpt.
high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... generalizability / applicability across domains
The framework was validated in collaboration with a large-scale supermarket chain.
Claim of field validation stated in the paper; indicates at least one real-world collaboration but provides no further details (e.g., number of stores, duration, metrics) in the excerpt.
high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... field validation / real-world deployment
Evaluation indicates Flowr enables proactive exception handling at a scale unachievable through manual processes.
Empirical/operational claim based on the paper's evaluation and deployment context; the excerpt asserts this capability but does not provide quantitative performance metrics or comparison details.
high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... proactive exception handling capability and scale
Evaluation shows Flowr improves demand–supply alignment.
Empirical claim in the paper's evaluation; reported improvement in demand-supply alignment from deployment or testing with a large supermarket chain, but no numerical metrics provided in the excerpt.
high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... demand–supply alignment
Evaluation demonstrates that Flowr significantly reduces manual coordination overhead.
Empirical claim reported in the paper's evaluation section; the excerpt notes an evaluation and collaboration with a large supermarket chain but provides no sample size figures or quantitative effect sizes.
high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... manual coordination overhead (effort/time/coordination burden)
Central to the framework is a human-in-the-loop orchestration model in which supply chain managers supervise and intervene across workflow stages via a Model Context Protocol (MCP)-enabled interface, preserving accountability and organizational control.
Design/organizational claim describing human-in-the-loop orchestration and MCP interface; asserted in the paper without empirical measures of accountability or control in the excerpt.
high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... preservation of accountability and organizational control during automation
To ensure task accuracy and adherence to responsible AI principles, the framework employs a consortium of fine-tuned, domain-specialized large language models coordinated by a central reasoning LLM.
Technical/design claim in the paper describing model architecture and approach; no evaluation metrics or tests of accuracy/responsibility provided in the excerpt.
high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... task accuracy and adherence to responsible AI principles
Flowr systematically decomposes manual supply chain operations into specialized AI agents, each responsible for a clearly defined cognitive role, enabling automation of processes previously dependent on continuous human coordination.
Architectural claim — asserted mechanism of the framework in the paper; presented as part of the framework design, no quantitative evaluation details in the excerpt.
high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... task decomposition and automation of previously human-coordinated processes
This paper introduces Flowr, a novel agentic AI framework for automating end-to-end retail supply chain workflows in large-scale supermarket operations.
Design and system-proposal claim in the paper; supported by framework description rather than empirical testing in the provided text.
high positive Flowr -- Scaling Up Retail Supply Chain Operations Through A... ability to automate end-to-end supply chain workflows (task allocation to AI)
Generative AI helps users solve problems more efficiently.
Motivating empirical observation stated in the paper (no sample or empirical analysis reported in the provided text); assumption used to motivate the theoretical model.
high positive When AI Improves Answers but Slows Knowledge Creation: Match... problem-solving efficiency (implicit)
By elucidating the mechanisms and trade-offs inherent in AI-human collaboration, this work lays a robust foundation for future research on adaptive decision systems.
Authors' forward-looking claim in the abstract that their synthesis clarifies mechanisms/trade-offs and thus supports subsequent research; based on their review and framework.
high positive Advancing Decision-Making through AI-Human Collaboration: A ... foundation for future research on adaptive decision systems
By synthesizing these paradigms, this research advances the theoretical understanding of hybrid decision-making systems and provides actionable insights for organizations navigating complex and AI-driven environments.
Authors' stated contribution based on the conceptual synthesis of the literature and the proposed framework (as reported in the abstract).
high positive Advancing Decision-Making through AI-Human Collaboration: A ... theoretical advancement and provision of actionable organizational insights
The framework introduces four distinct paradigms of AI-human collaborative decision-making: adaptive intuitive decision, programmed algorithmic decision, interpretive analytical decision and integrative hybrid decision.
Authors' conceptual taxonomy reported in the abstract, produced from synthesis of the reviewed literature (627 articles).
high positive Advancing Decision-Making through AI-Human Collaboration: A ... classification of AI-human collaborative decision-making into four paradigms
We developed a novel conceptual framework that identifies two critical dimensions, AI-human dynamics and decision typologies, that shape decision outcomes.
Authors' reported conceptual synthesis derived from the systematic review/bibliometric analysis of the 627 articles.
high positive Advancing Decision-Making through AI-Human Collaboration: A ... identification of critical dimensions affecting decision outcomes
Prompts can be treated as decision policies that allocate discretion between researcher and system, governing what is executed and when iteration stops.
Methodological framing advanced by the authors describing prompts as decision policies; conceptual claim based on the paper's analytic framework rather than empirical measurement.
high positive On the Carbon Footprint of Economic Research in the Age of G... conceptualization of prompts' role in workflow control and decision allocation
Operational constraints and decision rule prompts deliver large and stable footprint reductions while preserving decision equivalent topic outputs.
Experimental comparisons of prompt strategies in the benchmarked workflow showing reductions in runtime/CO2e and evaluated topic outputs' decision-equivalence (asserted in abstract; no numeric reductions or sample sizes provided).
high positive On the Carbon Footprint of Economic Research in the Age of G... carbon footprint / runtime reductions and preservation of topic output equivalen...
We benchmark a modern economic survey workflow, an LDA-based literature mapping implemented with GenAI assisted coding and executed in a fixed cloud notebook, measuring runtime and estimated CO2e with CodeCarbon.
Experimental benchmark described in the paper: single implemented workflow (LDA-based literature mapping) executed in a fixed cloud notebook with runtime and CO2e measured using CodeCarbon (methodological claim).
high positive On the Carbon Footprint of Economic Research in the Age of G... runtime and estimated CO2e (carbon footprint) of the benchmarked workflow
Training footprint is the largest cluster in the mapped Green AI literature.
Result from the paper's literature mapping / clustering (statement in abstract; no numeric cluster sizes given).
high positive On the Carbon Footprint of Economic Research in the Age of G... relative prevalence (cluster size) of 'training footprint' theme
We map the recent Green AI literature into seven themes: training footprint is the largest cluster, while inference efficiency and system level optimisation are growing rapidly, alongside measurement protocols, green algorithms, governance, and security and efficiency trade-offs.
Bibliometric / thematic mapping of recent Green AI literature described in the paper (method: literature mapping; exact number of papers or mapping procedure not specified in abstract).
high positive On the Carbon Footprint of Economic Research in the Age of G... distribution of themes within Green AI literature (theme prevalence and growth)
We share our methodology and lessons learned to enable other organizations to construct similar production-derived benchmarks.
Paper states intention and contribution: releasing methodology and lessons to allow replication by other organizations.
high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... ability of other organizations to construct similar benchmarks
We detail data collection and curation practices including LLM-based task classification, test relevance validation, and multi-run stability checks to address challenges in constructing reliable evaluation signals from monorepo environments.
Methodological description in paper listing specific practices (LLM-based classification, test relevance validation, multi-run stability checks) aimed at producing reliable evaluation signals in monorepos.
high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... reliability of evaluation signals derived from monorepo environments
Models making greater use of work validation tools, such as executing tests and invoking static analysis, achieve higher solve rates.
Reported relationship from paper's analysis correlating models' use of verification tools (test execution, static analysis) with higher solve rates across evaluated models.
high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... solve rate (task success) as a function of verification tool usage
Systematic analysis of four foundation models yields solve rates from 53.2% to 72.2%.
Empirical evaluation reported in paper: four foundation models were evaluated on the ProdCodeBench benchmark producing reported solve-rate range.
high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... solve rate (task success rate)
Each curated sample consists of a verbatim prompt, a committed code change and fail-to-pass tests spanning seven programming languages.
Descriptive dataset claim in paper specifying components of each sample and that samples cover seven programming languages.
high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... dataset composition (prompt, code change, tests) and language coverage (7 langua...
We present ProdCodeBench, a benchmark built from real sessions with a production AI coding assistant.
Paper describes methodology and introduces ProdCodeBench explicitly as constructed from real production assistant sessions.
high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... existence and provenance of benchmark (production-derived dataset)
Benchmarks that reflect production workloads are better for evaluating AI coding agents in industrial settings.
Argument presented in paper motivating creation of production-derived benchmark; no specific empirical comparison to alternative benchmarks reported in the abstract.
high positive ProdCodeBench: A Production-Derived Benchmark for Evaluating... quality of evaluation for AI coding agents (suitability of benchmark)
A representative incident (ISS-004) demonstrated boundary-based containment with 10-minute detection latency, zero user exposure, and 80-minute resolution.
Incident ISS-004 report in the paper giving specific timings for detection latency (10 minutes), user exposure (zero), and resolution (80 minutes).
high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... incident detection latency, user exposure, and time-to-resolution
The multi-agent approach improved reliability: audited handoffs detected and blocked a coordinate transformation error affecting all 2,452 stations before publication.
Incident detection reported in the SF2Bench deployment where audited handoffs prevented publication of a coordinate transformation error that would have affected all 2,452 stations.
high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... detection/blocking of a systemic coordinate transformation error (error preventi...
The multi-agent approach improved efficiency — the SF2Bench deployment was completed by a single operator in two days with repeated artifact reuse across deployments.
Operational report from the production deployment: single operator completion time of two days and reuse of artifacts across deployments as stated in the paper.
high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... time to complete deployment (task completion time) and operator effort
SF2Bench, a compound flooding benchmark comprising 2,452 monitoring stations and 8,557 published files spanning 39 years, validates the multi-agent workflow.
Reported dataset composition and use in the paper: SF2Bench with stated counts and temporal span used to validate the multi-agent workflow.
high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... scale and temporal coverage of benchmark used to validate workflow (stations, fi...
EnviSmart treats reliability as an architectural property through two mechanisms: (1) a three-track knowledge architecture that externalizes behaviors (governance constraints), domain knowledge (retrievable context), and skills (tool-using procedures) as persistent, interlocking artifacts; and (2) a role-separated multi-agent design where deterministic validators and audited handoffs restore fail-stop semantics at trust boundaries before irreversible steps.
System architecture and design description in the paper; presented as the core reliability mechanisms implemented in EnviSmart.
high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... architectural approach to reliability (design features implemented)
We introduce EnviSmart, a production data management system deployed on campus-wide storage infrastructure for environmental research.
System description and statement of deployment in the paper; presented as a production deployment (no randomized evaluation reported).
high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... existence and production deployment of EnviSmart
Embedding LLM-driven agents into environmental FAIR data management can externalize operational knowledge and scale curation across heterogeneous data and evolving conventions.
Conceptual / argumentative claim made in the paper as a motivation for the system; no quantitative experiment tied to this statement in the excerpt.
high positive Exploring Robust Multi-Agent Workflows for Environmental Dat... ability to externalize operational knowledge and scale curation
Overcoming the structural skill deficit through deliberate investment in tertiary education reform and strong private-public partnerships for continuous vocational learning is mandatory for Nigeria to successfully leverage the AI revolution for inclusive economic growth and ensure long-term workforce resilience.
Study conclusion synthesizing survey results (150 firms) and qualitative policy/workforce analysis to make policy recommendations.
high positive Human Capital and the AI-Powered Future of Work: (Training, ... inclusive economic growth and long-term workforce resilience
The rate of new job creation hinges critically on the immediate implementation of targeted, scalable reskilling programs.
Paper's projections and analysis drawing on the survey of 150 firms and qualitative interviews; presented as a conditional/projection based on current skills gap and training initiatives.
The agentic-specificity classification helps organizations distinguish challenges that require novel approaches from those that are addressable with established practices.
Authors' proposed classification (agentic-specific vs. carried-over/amplified) intended as a practical decision aid; derived from the coding and comparative analysis.
high positive BARRIERS TO AGENTIC AI ENTERPRISE TRANSFORMATION practical_utility_of_agentic_specificity_classification
The taxonomy provides a diagnostic framework for identifying priority barrier dimensions and understanding cross-dimensional amplification mechanisms.
Authors present a taxonomy derived from the review and claim it can be used diagnostically by organizations; supported by the coded barrier classification and STS mapping.
high positive BARRIERS TO AGENTIC AI ENTERPRISE TRANSFORMATION usefulness_of_taxonomy_for_diagnosis