Evidence (6574 claims)
Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 761 | 200 | 101 | 904 | 2020 |
| Governance & Regulation | 829 | 400 | 191 | 122 | 1566 |
| Organizational Efficiency | 784 | 193 | 125 | 84 | 1197 |
| Technology Adoption Rate | 637 | 236 | 124 | 97 | 1103 |
| Research Productivity | 431 | 131 | 58 | 340 | 972 |
| Output Quality | 481 | 183 | 59 | 47 | 770 |
| Decision Quality | 332 | 177 | 82 | 49 | 647 |
| Firm Productivity | 439 | 57 | 88 | 20 | 610 |
| AI Safety & Ethics | 218 | 279 | 66 | 33 | 602 |
| Market Structure | 181 | 170 | 123 | 24 | 503 |
| Task Allocation | 214 | 64 | 72 | 33 | 388 |
| Skill Acquisition | 174 | 62 | 62 | 17 | 315 |
| Innovation Output | 204 | 27 | 45 | 18 | 295 |
| Employment Level | 105 | 54 | 108 | 13 | 282 |
| Fiscal & Macroeconomic | 132 | 69 | 43 | 26 | 277 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 154 | 48 | 26 | 3 | 231 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 123 | 50 | 6 | 223 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 71 | 92 | 10 | 2 | 175 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 58 | 56 | 26 | 13 | 156 |
| Training Effectiveness | 96 | 21 | 14 | 19 | 152 |
| Wages & Compensation | 77 | 37 | 25 | 6 | 145 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 81 | 21 | 1 | 115 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 47 | 6 | 1 | 59 |
| Social Protection | 28 | 16 | 8 | 2 | 54 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
We developed a triadic collaboration system to support K-12 writing learning that coordinates LLMs, teachers, and students.
Methodological claim stated in the abstract that the authors designed and developed a triadic collaboration system for K-12 writing learning; presumably implemented and evaluated using the dataset.
Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.
Statement of planned/ongoing work in the paper regarding future benchmark inclusion to address bias and human-centered use cases; no empirical results provided.
Implemented tests include causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes.
Specific list of implemented test categories provided in the paper; descriptive/reporting evidence from the initiative's work.
Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion.
Paper reports that a set of tests have been implemented and applied to AI tools across qualitative and quantitative modeling and discussion; no sample sizes or numeric evaluation results provided in the excerpt.
A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests.
Organizational description in the paper specifying roles (steering group and technical group); no quantitative evaluation reported.
The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly.
Descriptive statement about the open-source project hosted by the initiative; no empirical measures of transparency or contribution sharing provided.
The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation.
Descriptive claim in the paper about organizational approach (open infrastructure and collaborative evaluation); no empirical testing or sample size reported.
The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices.
Descriptive statement about the Initiative's stated aims and purpose in the paper; organizational description rather than empirical evidence.
Tools that can automate aspects of modeling practice must complement human expertise, not replace it.
Normative claim made in the paper (argument about human-centered design); no empirical evidence or sample size reported.
AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable.
Normative assertion in the paper (position statement / requirement); no empirical study or sample size reported.
The agentic future is not predetermined; leaders must both skate to where the puck is going and actively steer it toward a good place, ensuring innovation delivers welfare gains felt by businesses and consumers around the world.
Normative recommendation offered by the authors; based on conceptual argument and interpretation of the framework rather than empirical testing in the excerpt.
These complementary investments produce the familiar 'productivity J-curve' of general-purpose technologies.
Stated as an economic analogy/claim drawing on general-purpose technology literature; presented as an asserted mechanism rather than shown with new empirical estimates in the excerpt.
The most consequential disruption resides in the third stage (Reconstruction) where workflows and markets are rebuilt around delegation, machine-to-machine interaction, continuous monitoring, and auditable constraints.
Theoretical claim in the paper backed by conceptual reasoning and illustrative sector examples; no quantitative evidence provided in the excerpt.
The system preserves human agency via override mechanisms.
Design description of the collaborative forecasting system that explicitly includes override controls for human users.
The paper provides a rigorous blueprint for designing synergistic, trustworthy, and diagnostic operational planning tools, contributing to the discourse on human-AI collaboration and sustainable information systems (IS).
Stated contribution in the paper's conclusions: presentation of a blueprint and implications for human-AI collaboration and sustainable IS.
Two think-aloud sessions show that human judgment remains critical for high-uncertainty events.
Qualitative evaluation consisting of two think-aloud sessions reported in the paper.
Algorithmic benchmarking reduced forecast errors by 30% over naive baselines.
Quantitative algorithmic benchmarking reported in the evaluation section of the paper (comparison vs. naive baselines).
Because reputation-based, ex post sanctions cannot be relied upon for dissociative agents, governance should shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.
Prescriptive recommendation derived from the theoretical critique of identity-based governance; paper proposes observability- and protocol-focused alternatives but does not present empirical tests or trials.
Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility.
Conceptual/theoretical argument presented in the paper drawing on reputation theory and social signaling; no empirical sample or quantitative data reported.
Teams interacting with more embodied agents display conversational patterns that more closely resemble human–human dialogue.
Conversational analysis comparing dialogue patterns across teams interacting with different embodiment levels; the abstract reports greater similarity to human–human dialogue for teams with higher embodiment agents, but does not provide the similarity metric values or sample sizes.
Human-only teams are more likely to complete all tasks successfully (higher task completion success) than mixed human–AI teams.
Comparison of task completion success between human-only teams and mixed teams in the escape room experiment as reported in the paper; no numerical completion rates provided in the abstract.
Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.
Synthesis conclusion based on RADAR deployment results, telemetry (535K+ reviewed diffs, 331K+ landed), and comparative analyses (before-after and difference-in-differences) reported in the paper.
RADAR reduces median diff review wall time by 35%.
Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median diff review wall time reduction reported as 35%. Sample likely drawn from RADAR telemetry (535K+ diffs) though not explicitly stated for this metric in the excerpt.
RADAR reduces median time to close by over 330%.
Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median time-to-close reduction reported as 'over 330%'. Underlying sample for efficiency analysis likely from the RADAR telemetry (535K+ diffs), though the excerpt does not give the precise sample for this metric.
The Production Incident rate for RADAR-reviewed diffs is 1/50 that of non-RADAR diffs.
Comparative observational analysis reported in the paper; production incident rate for RADAR-reviewed diffs compared to non-RADAR diffs, with the relative rate given as 1/50. Exact absolute counts not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.
The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs.
Comparative observational analysis reported in the paper contrasting RADAR-reviewed diffs with non-RADAR diffs. Underlying counts and exact sample split not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.
Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%.
Policy threshold comparison reported in the paper using observational before-after comparisons and system telemetry; approval rate reported as 60.31% after threshold change.
RADAR has reviewed 535K+ diffs and landed 331K+ changes.
System deployment telemetry reported in the paper: 'RADAR has reviewed 535K+ diffs and landed 331K+.'
Agentic AI was responsible for over 80% of that growth in code volume.
Attribution analysis reported in the paper linking growth in code/diff volume to agentic AI sources; described as 'over 80% of that growth.' The underlying attribution method is not detailed in the excerpt.
Per-developer diff volume rose 51% (year over year) at Meta.
Internal telemetry/observational analysis reported in the paper; stated as a 51% increase in per-developer diff volume. No explicit sample size for this specific measure provided in the excerpt.
At Meta, significant lines of code per human-landed diff grew by 105.9% year over year.
Internal telemetry/observational analysis reported in the paper; stated as a year-over-year percentage growth for Meta. No sample size for this specific measure provided in the excerpt.
We discuss implications for Information Systems (IS) design and propose future field evaluations.
Paper includes a discussion section outlining IS design implications and suggestions for future empirical/field work.
The approach preserves statistical rigour, traceability, and nuanced Persevere/Iterate decisions when accelerating experimentation.
Reported outcomes of controlled simulations and description of system design that enforces statistical procedures and logging; stated in manuscript as findings.
Logs render capabilities observable at the feature level, turning 'agentic AI' into a disciplined experimentation infrastructure rather than a generic assistant.
Implementation logs and descriptions from the Node.js instantiation reported in the paper; qualitative claim about observability and traceability at the feature level.
The Multi Agent System reduces time-to-validated-learning by roughly an order of magnitude while preserving statistical rigour, traceability, and nuanced Persevere/Iterate decisions.
Results from the controlled simulations reported in the paper (comparison between agentic multi-agent system and manual B-M-L cycles).
Controlled simulations compare agentic and manual B-M-L cycles on feature ideas.
Reported controlled simulation experiments in the paper comparing agentic (multi-agent) and manual B-M-L cycles; methodological description present in manuscript.
We instantiate them in a Node.js package instrumenting a production-grade SaaS codebase.
Implementation artifact reported in the paper (Node.js package) and description of instrumentation on a production-grade SaaS codebase.
Drawing on the Dynamic Capabilities View, we derive fifteen meta-requirements and thirty-three design principles (consolidated into seven goal-directed groups) for sensing, seizing, reconfiguring, orchestration, and governance.
Design-theory derivation reported in the paper (counts of meta-requirements and design principles are stated in the manuscript).
We propose a multi-agent artefact that operationalises the Build–Measure–Learn (B-M-L) cycle as a closed-loop control system.
Design science study described in the paper; conceptual derivation and artifact instantiation (Node.js package) reported in the manuscript.
This paper contributes a theoretically specified mediating mechanism in the algorithmic management and employee silence literature and advances a conceptual framework addressing this relationship in conventional non-platform manufacturing in an emerging economy context.
Author-stated contribution in the abstract summarising the conceptual/theoretical advancement made by the paper.
The paper advances three formal propositions linking algorithmic management, perceived voice futility, and acquiescent silence, and derives three HRM intervention pathways from the framework.
Explicit claims about the paper's contributions and outputs (theoretical propositions and HRM intervention pathways presented in the manuscript).
Specific institutional conditions in Malaysian manufacturing SMEs — HRM informality, digital capability gaps, and technology–governance decoupling — structurally amplify the proposed mechanism linking algorithmic management to acquiescent silence.
Institutional argument developed in the paper (conceptual analysis of contextual factors in Malaysian SMEs; no reported empirical validation).
Algorithmic management frustrates employees' needs for autonomy, competence, and relatedness, generating a cognitive appraisal of futility that drives resignation-based acquiescent silence.
Theoretical argument in the paper using self-determination theory and organisational silence theory (conceptual reasoning; no empirical data reported).
Perceived voice futility is the mediating mechanism connecting algorithmic management to acquiescent silence in conventional manufacturing workplaces.
Conceptual framework developed in the paper drawing on self-determination theory and organisational silence theory (theoretical proposition, no primary empirical test reported).
Algorithmic management systems are increasingly deployed in manufacturing small and medium-enterprises (SMEs) in Malaysia under the Industry 4.0 agenda.
Author statement in paper abstract; asserted based on observation/literature about Industry 4.0 adoption in Malaysia (conceptual/descriptive claim).
There is a need for privacy-preserving deployments and richer, structure-aware representations of human knowledge for practical use.
Authors' recommendation/conclusion drawn from observed accuracy/limitations and privacy considerations in using long-term Slack logs.
Gemini 2.5 Flash achieved the lowest error (MAE 21.13%).
Reported model evaluation results comparing MAE across models; Gemini 2.5 Flash reported as lowest with MAE 21.13%.
We analyze 27,188 messages from 43 users to investigate whether LLMs can infer individual domain knowledge from long-term Slack logs.
Dataset description reported in the paper: 27,188 Slack messages from 43 users.
Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.
Statement in abstract and provided URL pointing to project artifacts.
Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design.
Paper's stated contribution and intended purpose (abstract) and provision of dataset/benchmark artifacts via project website.