The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6574 claims)

Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 761 200 101 904 2020
Governance & Regulation 829 400 191 122 1566
Organizational Efficiency 784 193 125 84 1197
Technology Adoption Rate 637 236 124 97 1103
Research Productivity 431 131 58 340 972
Output Quality 481 183 59 47 770
Decision Quality 332 177 82 49 647
Firm Productivity 439 57 88 20 610
AI Safety & Ethics 218 279 66 33 602
Market Structure 181 170 123 24 503
Task Allocation 214 64 72 33 388
Skill Acquisition 174 62 62 17 315
Innovation Output 204 27 45 18 295
Employment Level 105 54 108 13 282
Fiscal & Macroeconomic 132 69 43 26 277
Consumer Welfare 117 63 42 11 233
Firm Revenue 154 48 26 3 231
Task Completion Time 173 31 8 12 225
Inequality Measures 44 123 50 6 223
Worker Satisfaction 89 65 22 12 188
Error Rate 71 92 10 2 175
Regulatory Compliance 77 69 14 5 165
Automation Exposure 58 56 26 13 156
Training Effectiveness 96 21 14 19 152
Wages & Compensation 77 37 25 6 145
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 81 21 1 115
Hiring & Recruitment 52 7 8 3 70
Creative Output 32 20 8 3 64
Skill Obsolescence 5 47 6 1 59
Social Protection 28 16 8 2 54
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Human Ai Collab Remove filter
We developed a triadic collaboration system to support K-12 writing learning that coordinates LLMs, teachers, and students.
Methodological claim stated in the abstract that the authors designed and developed a triadic collaboration system for K-12 writing learning; presumably implemented and evaluated using the dataset.
high positive Double-Edged Sword or Sharp Tool? Designing and Evaluating T... presence and functionality of the triadic collaboration system
Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.
Statement of planned/ongoing work in the paper regarding future benchmark inclusion to address bias and human-centered use cases; no empirical results provided.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... planned incorporation of bias-aware benchmarks and human-centered use case consi...
Implemented tests include causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes.
Specific list of implemented test categories provided in the paper; descriptive/reporting evidence from the initiative's work.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... types/categories of tests implemented
Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion.
Paper reports that a set of tests have been implemented and applied to AI tools across qualitative and quantitative modeling and discussion; no sample sizes or numeric evaluation results provided in the excerpt.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... existence and application of implemented evaluation tests across types of modeli...
A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests.
Organizational description in the paper specifying roles (steering group and technical group); no quantitative evaluation reported.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... organizational roles for benchmark prioritization and implementation
The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly.
Descriptive statement about the open-source project hosted by the initiative; no empirical measures of transparency or contribution sharing provided.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... transparency and breadth of contributions enabled by the open source sd ai proje...
The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation.
Descriptive claim in the paper about organizational approach (open infrastructure and collaborative evaluation); no empirical testing or sample size reported.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... use of open infrastructure for collaborative evaluation
The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices.
Descriptive statement about the Initiative's stated aims and purpose in the paper; organizational description rather than empirical evidence.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... existence and purpose of the BEAMS Initiative (benchmarking for responsible/ethi...
Tools that can automate aspects of modeling practice must complement human expertise, not replace it.
Normative claim made in the paper (argument about human-centered design); no empirical evidence or sample size reported.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... relationship between automated modeling tools and human expertise (complementari...
AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable.
Normative assertion in the paper (position statement / requirement); no empirical study or sample size reported.
high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... ability of AI tools to build interpretable simulation models that inform recomme...
The agentic future is not predetermined; leaders must both skate to where the puck is going and actively steer it toward a good place, ensuring innovation delivers welfare gains felt by businesses and consumers around the world.
Normative recommendation offered by the authors; based on conceptual argument and interpretation of the framework rather than empirical testing in the excerpt.
high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... policy/leadership influence on welfare distribution of AI-driven innovation
These complementary investments produce the familiar 'productivity J-curve' of general-purpose technologies.
Stated as an economic analogy/claim drawing on general-purpose technology literature; presented as an asserted mechanism rather than shown with new empirical estimates in the excerpt.
high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... productivity trajectory (J-curve) following complementary investments
The most consequential disruption resides in the third stage (Reconstruction) where workflows and markets are rebuilt around delegation, machine-to-machine interaction, continuous monitoring, and auditable constraints.
Theoretical claim in the paper backed by conceptual reasoning and illustrative sector examples; no quantitative evidence provided in the excerpt.
high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... magnitude/importance of disruption arising from Reconstruction-stage changes
The system preserves human agency via override mechanisms.
Design description of the collaborative forecasting system that explicitly includes override controls for human users.
high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... preservation of human agency (ability to override algorithmic forecasts)
The paper provides a rigorous blueprint for designing synergistic, trustworthy, and diagnostic operational planning tools, contributing to the discourse on human-AI collaboration and sustainable information systems (IS).
Stated contribution in the paper's conclusions: presentation of a blueprint and implications for human-AI collaboration and sustainable IS.
high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... guidance/blueprint for operational planning tool design
Two think-aloud sessions show that human judgment remains critical for high-uncertainty events.
Qualitative evaluation consisting of two think-aloud sessions reported in the paper.
high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... importance/role of human judgment in handling high-uncertainty forecasting event...
Algorithmic benchmarking reduced forecast errors by 30% over naive baselines.
Quantitative algorithmic benchmarking reported in the evaluation section of the paper (comparison vs. naive baselines).
Because reputation-based, ex post sanctions cannot be relied upon for dissociative agents, governance should shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.
Prescriptive recommendation derived from the theoretical critique of identity-based governance; paper proposes observability- and protocol-focused alternatives but does not present empirical tests or trials.
high positive Dissociative Identity: Language Model Agents Lack Grounding ... governance effectiveness of observability-based, ex ante protocol mechanisms
Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility.
Conceptual/theoretical argument presented in the paper drawing on reputation theory and social signaling; no empirical sample or quantitative data reported.
high positive Dissociative Identity: Language Model Agents Lack Grounding ... trustworthy behavior (sustaining equilibrium of trust)
Teams interacting with more embodied agents display conversational patterns that more closely resemble human–human dialogue.
Conversational analysis comparing dialogue patterns across teams interacting with different embodiment levels; the abstract reports greater similarity to human–human dialogue for teams with higher embodiment agents, but does not provide the similarity metric values or sample sizes.
high positive Teaming Up with Artificial Agents in Non-routine Analytical ... conversational pattern similarity to human–human dialogue
Human-only teams are more likely to complete all tasks successfully (higher task completion success) than mixed human–AI teams.
Comparison of task completion success between human-only teams and mixed teams in the escape room experiment as reported in the paper; no numerical completion rates provided in the abstract.
high positive Teaming Up with Artificial Agents in Non-routine Analytical ... task completion / success rate
Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.
Synthesis conclusion based on RADAR deployment results, telemetry (535K+ reviewed diffs, 331K+ landed), and comparative analyses (before-after and difference-in-differences) reported in the paper.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... reduction in review bottlenecks and preservation of production safety
RADAR reduces median diff review wall time by 35%.
Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median diff review wall time reduction reported as 35%. Sample likely drawn from RADAR telemetry (535K+ diffs) though not explicitly stated for this metric in the excerpt.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... median diff review wall time
RADAR reduces median time to close by over 330%.
Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median time-to-close reduction reported as 'over 330%'. Underlying sample for efficiency analysis likely from the RADAR telemetry (535K+ diffs), though the excerpt does not give the precise sample for this metric.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... median time to close for diffs
The Production Incident rate for RADAR-reviewed diffs is 1/50 that of non-RADAR diffs.
Comparative observational analysis reported in the paper; production incident rate for RADAR-reviewed diffs compared to non-RADAR diffs, with the relative rate given as 1/50. Exact absolute counts not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... production incident rate (RADAR vs non-RADAR)
The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs.
Comparative observational analysis reported in the paper contrasting RADAR-reviewed diffs with non-RADAR diffs. Underlying counts and exact sample split not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... diff revert rate (RADAR vs non-RADAR)
Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%.
Policy threshold comparison reported in the paper using observational before-after comparisons and system telemetry; approval rate reported as 60.31% after threshold change.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... approve rate of diffs under RADAR as a function of Diff Risk Score threshold
RADAR has reviewed 535K+ diffs and landed 331K+ changes.
System deployment telemetry reported in the paper: 'RADAR has reviewed 535K+ diffs and landed 331K+.'
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... number of diffs reviewed and diffs landed by RADAR
Agentic AI was responsible for over 80% of that growth in code volume.
Attribution analysis reported in the paper linking growth in code/diff volume to agentic AI sources; described as 'over 80% of that growth.' The underlying attribution method is not detailed in the excerpt.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... share of growth in code/diff volume attributable to agentic AI
Per-developer diff volume rose 51% (year over year) at Meta.
Internal telemetry/observational analysis reported in the paper; stated as a 51% increase in per-developer diff volume. No explicit sample size for this specific measure provided in the excerpt.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... per-developer diff volume (year-over-year change)
At Meta, significant lines of code per human-landed diff grew by 105.9% year over year.
Internal telemetry/observational analysis reported in the paper; stated as a year-over-year percentage growth for Meta. No sample size for this specific measure provided in the excerpt.
high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... lines of code per human-landed diff (year-over-year growth)
We discuss implications for Information Systems (IS) design and propose future field evaluations.
Paper includes a discussion section outlining IS design implications and suggestions for future empirical/field work.
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... proposed implications and future research directions
The approach preserves statistical rigour, traceability, and nuanced Persevere/Iterate decisions when accelerating experimentation.
Reported outcomes of controlled simulations and description of system design that enforces statistical procedures and logging; stated in manuscript as findings.
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... statistical rigour, traceability, and decision quality in experimentation (Perse...
Logs render capabilities observable at the feature level, turning 'agentic AI' into a disciplined experimentation infrastructure rather than a generic assistant.
Implementation logs and descriptions from the Node.js instantiation reported in the paper; qualitative claim about observability and traceability at the feature level.
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... feature-level observability/traceability of experimentation activities
The Multi Agent System reduces time-to-validated-learning by roughly an order of magnitude while preserving statistical rigour, traceability, and nuanced Persevere/Iterate decisions.
Results from the controlled simulations reported in the paper (comparison between agentic multi-agent system and manual B-M-L cycles).
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... time-to-validated-learning (and preservation of statistical rigour, traceability...
Controlled simulations compare agentic and manual B-M-L cycles on feature ideas.
Reported controlled simulation experiments in the paper comparing agentic (multi-agent) and manual B-M-L cycles; methodological description present in manuscript.
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... comparison of agentic vs manual B-M-L cycles (experimentation performance metric...
We instantiate them in a Node.js package instrumenting a production-grade SaaS codebase.
Implementation artifact reported in the paper (Node.js package) and description of instrumentation on a production-grade SaaS codebase.
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... existence and instantiation of a Node.js package that instruments a SaaS codebas...
Drawing on the Dynamic Capabilities View, we derive fifteen meta-requirements and thirty-three design principles (consolidated into seven goal-directed groups) for sensing, seizing, reconfiguring, orchestration, and governance.
Design-theory derivation reported in the paper (counts of meta-requirements and design principles are stated in the manuscript).
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... number and organization of derived meta-requirements and design principles
We propose a multi-agent artefact that operationalises the Build–Measure–Learn (B-M-L) cycle as a closed-loop control system.
Design science study described in the paper; conceptual derivation and artifact instantiation (Node.js package) reported in the manuscript.
high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... operationalisation of the Build–Measure–Learn cycle as a closed-loop control sys...
This paper contributes a theoretically specified mediating mechanism in the algorithmic management and employee silence literature and advances a conceptual framework addressing this relationship in conventional non-platform manufacturing in an emerging economy context.
Author-stated contribution in the abstract summarising the conceptual/theoretical advancement made by the paper.
high positive Algorithmic Management and Acquiescent Silence: The Mediatin... theoretical contribution to literature
The paper advances three formal propositions linking algorithmic management, perceived voice futility, and acquiescent silence, and derives three HRM intervention pathways from the framework.
Explicit claims about the paper's contributions and outputs (theoretical propositions and HRM intervention pathways presented in the manuscript).
high positive Algorithmic Management and Acquiescent Silence: The Mediatin... theoretical propositions and recommended HRM interventions
Specific institutional conditions in Malaysian manufacturing SMEs — HRM informality, digital capability gaps, and technology–governance decoupling — structurally amplify the proposed mechanism linking algorithmic management to acquiescent silence.
Institutional argument developed in the paper (conceptual analysis of contextual factors in Malaysian SMEs; no reported empirical validation).
high positive Algorithmic Management and Acquiescent Silence: The Mediatin... amplification of mechanism (increased likelihood/intensity of perceived voice fu...
Algorithmic management frustrates employees' needs for autonomy, competence, and relatedness, generating a cognitive appraisal of futility that drives resignation-based acquiescent silence.
Theoretical argument in the paper using self-determination theory and organisational silence theory (conceptual reasoning; no empirical data reported).
high positive Algorithmic Management and Acquiescent Silence: The Mediatin... need frustration (autonomy/competence/relatedness) and acquiescent silence
Perceived voice futility is the mediating mechanism connecting algorithmic management to acquiescent silence in conventional manufacturing workplaces.
Conceptual framework developed in the paper drawing on self-determination theory and organisational silence theory (theoretical proposition, no primary empirical test reported).
high positive Algorithmic Management and Acquiescent Silence: The Mediatin... acquiescent silence (employee silence behaviour)
Algorithmic management systems are increasingly deployed in manufacturing small and medium-enterprises (SMEs) in Malaysia under the Industry 4.0 agenda.
Author statement in paper abstract; asserted based on observation/literature about Industry 4.0 adoption in Malaysia (conceptual/descriptive claim).
high positive Algorithmic Management and Acquiescent Silence: The Mediatin... deployment/adoption of algorithmic management systems
There is a need for privacy-preserving deployments and richer, structure-aware representations of human knowledge for practical use.
Authors' recommendation/conclusion drawn from observed accuracy/limitations and privacy considerations in using long-term Slack logs.
high positive Can AI Guess What You Know? Performance Comparison of Large ... requirement for privacy-preserving deployment practices and improved representat...
Gemini 2.5 Flash achieved the lowest error (MAE 21.13%).
Reported model evaluation results comparing MAE across models; Gemini 2.5 Flash reported as lowest with MAE 21.13%.
high positive Can AI Guess What You Know? Performance Comparison of Large ... mean absolute error (MAE) of skill estimates
We analyze 27,188 messages from 43 users to investigate whether LLMs can infer individual domain knowledge from long-term Slack logs.
Dataset description reported in the paper: 27,188 Slack messages from 43 users.
high positive Can AI Guess What You Know? Performance Comparison of Large ... dataset size and coverage (messages and users analyzed)
Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.
Statement in abstract and provided URL pointing to project artifacts.
high positive MUSE: Benchmarking Manufacturable, Functional, and Assemblab... availability of project website, leaderboard, dataset, and code
Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design.
Paper's stated contribution and intended purpose (abstract) and provision of dataset/benchmark artifacts via project website.
high positive MUSE: Benchmarking Manufacturable, Functional, and Assemblab... utility of benchmark and evaluation framework for advancing Text-to-CAD toward e...