Evidence (6574 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	761	200	101	904	2020
Governance & Regulation	829	400	191	122	1566
Organizational Efficiency	784	193	125	84	1197
Technology Adoption Rate	637	236	124	97	1103
Research Productivity	431	131	58	340	972
Output Quality	481	183	59	47	770
Decision Quality	332	177	82	49	647
Firm Productivity	439	57	88	20	610
AI Safety & Ethics	218	279	66	33	602
Market Structure	181	170	123	24	503
Task Allocation	214	64	72	33	388
Skill Acquisition	174	62	62	17	315
Innovation Output	204	27	45	18	295
Employment Level	105	54	108	13	282
Fiscal & Macroeconomic	132	69	43	26	277
Consumer Welfare	117	63	42	11	233
Firm Revenue	154	48	26	3	231
Task Completion Time	173	31	8	12	225
Inequality Measures	44	123	50	6	223
Worker Satisfaction	89	65	22	12	188
Error Rate	71	92	10	2	175
Regulatory Compliance	77	69	14	5	165
Automation Exposure	58	56	26	13	156
Training Effectiveness	96	21	14	19	152
Wages & Compensation	77	37	25	6	145
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	81	21	1	115
Hiring & Recruitment	52	7	8	3	70
Creative Output	32	20	8	3	64
Skill Obsolescence	5	47	6	1	59
Social Protection	28	16	8	2	54
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

We developed a triadic collaboration system to support K-12 writing learning that coordinates LLMs, teachers, and students.

Methodological claim stated in the abstract that the authors designed and developed a triadic collaboration system for K-12 writing learning; presumably implemented and evaluated using the dataset.

high positive Double-Edged Sword or Sharp Tool? Designing and Evaluating T... presence and functionality of the triadic collaboration system

Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.

Statement of planned/ongoing work in the paper regarding future benchmark inclusion to address bias and human-centered use cases; no empirical results provided.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... planned incorporation of bias-aware benchmarks and human-centered use case consi...

Implemented tests include causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes.

Specific list of implemented test categories provided in the paper; descriptive/reporting evidence from the initiative's work.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... types/categories of tests implemented

Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion.

Paper reports that a set of tests have been implemented and applied to AI tools across qualitative and quantitative modeling and discussion; no sample sizes or numeric evaluation results provided in the excerpt.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... existence and application of implemented evaluation tests across types of modeli...

A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests.

Organizational description in the paper specifying roles (steering group and technical group); no quantitative evaluation reported.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... organizational roles for benchmark prioritization and implementation

The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly.

Descriptive statement about the open-source project hosted by the initiative; no empirical measures of transparency or contribution sharing provided.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... transparency and breadth of contributions enabled by the open source sd ai proje...

The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation.

Descriptive claim in the paper about organizational approach (open infrastructure and collaborative evaluation); no empirical testing or sample size reported.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... use of open infrastructure for collaborative evaluation

The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices.

Descriptive statement about the Initiative's stated aims and purpose in the paper; organizational description rather than empirical evidence.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... existence and purpose of the BEAMS Initiative (benchmarking for responsible/ethi...

Tools that can automate aspects of modeling practice must complement human expertise, not replace it.

Normative claim made in the paper (argument about human-centered design); no empirical evidence or sample size reported.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... relationship between automated modeling tools and human expertise (complementari...

AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable.

Normative assertion in the paper (position statement / requirement); no empirical study or sample size reported.

high positive BEAMS: Benchmarking and Evaluating AI for Modeling and Simul... ability of AI tools to build interpretable simulation models that inform recomme...

The agentic future is not predetermined; leaders must both skate to where the puck is going and actively steer it toward a good place, ensuring innovation delivers welfare gains felt by businesses and consumers around the world.

Normative recommendation offered by the authors; based on conceptual argument and interpretation of the framework rather than empirical testing in the excerpt.

high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... policy/leadership influence on welfare distribution of AI-driven innovation

These complementary investments produce the familiar 'productivity J-curve' of general-purpose technologies.

Stated as an economic analogy/claim drawing on general-purpose technology literature; presented as an asserted mechanism rather than shown with new empirical estimates in the excerpt.

high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... productivity trajectory (J-curve) following complementary investments

The most consequential disruption resides in the third stage (Reconstruction) where workflows and markets are rebuilt around delegation, machine-to-machine interaction, continuous monitoring, and auditable constraints.

Theoretical claim in the paper backed by conceptual reasoning and illustrative sector examples; no quantitative evidence provided in the excerpt.

high positive From Augmentation to Reconstruction: Guiding the AI Disrupti... magnitude/importance of disruption arising from Reconstruction-stage changes

The system preserves human agency via override mechanisms.

Design description of the collaborative forecasting system that explicitly includes override controls for human users.

high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... preservation of human agency (ability to override algorithmic forecasts)

The paper provides a rigorous blueprint for designing synergistic, trustworthy, and diagnostic operational planning tools, contributing to the discourse on human-AI collaboration and sustainable information systems (IS).

Stated contribution in the paper's conclusions: presentation of a blueprint and implications for human-AI collaboration and sustainable IS.

high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... guidance/blueprint for operational planning tool design

Two think-aloud sessions show that human judgment remains critical for high-uncertainty events.

Qualitative evaluation consisting of two think-aloud sessions reported in the paper.

high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... importance/role of human judgment in handling high-uncertainty forecasting event...

Algorithmic benchmarking reduced forecast errors by 30% over naive baselines.

Quantitative algorithmic benchmarking reported in the evaluation section of the paper (comparison vs. naive baselines).

high positive Schnitzel-Prediction: Designing Human-Ai Collaboration For C... forecast error

Because reputation-based, ex post sanctions cannot be relied upon for dissociative agents, governance should shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.

Prescriptive recommendation derived from the theoretical critique of identity-based governance; paper proposes observability- and protocol-focused alternatives but does not present empirical tests or trials.

high positive Dissociative Identity: Language Model Agents Lack Grounding ... governance effectiveness of observability-based, ex ante protocol mechanisms

Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility.

Conceptual/theoretical argument presented in the paper drawing on reputation theory and social signaling; no empirical sample or quantitative data reported.

high positive Dissociative Identity: Language Model Agents Lack Grounding ... trustworthy behavior (sustaining equilibrium of trust)

Teams interacting with more embodied agents display conversational patterns that more closely resemble human–human dialogue.

Conversational analysis comparing dialogue patterns across teams interacting with different embodiment levels; the abstract reports greater similarity to human–human dialogue for teams with higher embodiment agents, but does not provide the similarity metric values or sample sizes.

high positive Teaming Up with Artificial Agents in Non-routine Analytical ... conversational pattern similarity to human–human dialogue

Human-only teams are more likely to complete all tasks successfully (higher task completion success) than mixed human–AI teams.

Comparison of task completion success between human-only teams and mixed teams in the escape room experiment as reported in the paper; no numerical completion rates provided in the abstract.

high positive Teaming Up with Artificial Agents in Non-routine Analytical ... task completion / success rate

Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

Synthesis conclusion based on RADAR deployment results, telemetry (535K+ reviewed diffs, 331K+ landed), and comparative analyses (before-after and difference-in-differences) reported in the paper.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... reduction in review bottlenecks and preservation of production safety

RADAR reduces median diff review wall time by 35%.

Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median diff review wall time reduction reported as 35%. Sample likely drawn from RADAR telemetry (535K+ diffs) though not explicitly stated for this metric in the excerpt.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... median diff review wall time

RADAR reduces median time to close by over 330%.

Efficiency outcomes reported via telemetry and difference-in-differences analysis stated in the paper; median time-to-close reduction reported as 'over 330%'. Underlying sample for efficiency analysis likely from the RADAR telemetry (535K+ diffs), though the excerpt does not give the precise sample for this metric.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... median time to close for diffs

The Production Incident rate for RADAR-reviewed diffs is 1/50 that of non-RADAR diffs.

Comparative observational analysis reported in the paper; production incident rate for RADAR-reviewed diffs compared to non-RADAR diffs, with the relative rate given as 1/50. Exact absolute counts not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... production incident rate (RADAR vs non-RADAR)

The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs.

Comparative observational analysis reported in the paper contrasting RADAR-reviewed diffs with non-RADAR diffs. Underlying counts and exact sample split not provided in the excerpt; overall RADAR telemetry covers 535K+ diffs.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... diff revert rate (RADAR vs non-RADAR)

Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%.

Policy threshold comparison reported in the paper using observational before-after comparisons and system telemetry; approval rate reported as 60.31% after threshold change.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... approve rate of diffs under RADAR as a function of Diff Risk Score threshold

RADAR has reviewed 535K+ diffs and landed 331K+ changes.

System deployment telemetry reported in the paper: 'RADAR has reviewed 535K+ diffs and landed 331K+.'

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... number of diffs reviewed and diffs landed by RADAR

Agentic AI was responsible for over 80% of that growth in code volume.

Attribution analysis reported in the paper linking growth in code/diff volume to agentic AI sources; described as 'over 80% of that growth.' The underlying attribution method is not detailed in the excerpt.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... share of growth in code/diff volume attributable to agentic AI

Per-developer diff volume rose 51% (year over year) at Meta.

Internal telemetry/observational analysis reported in the paper; stated as a 51% increase in per-developer diff volume. No explicit sample size for this specific measure provided in the excerpt.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... per-developer diff volume (year-over-year change)

At Meta, significant lines of code per human-landed diff grew by 105.9% year over year.

Internal telemetry/observational analysis reported in the paper; stated as a year-over-year percentage growth for Meta. No sample size for this specific measure provided in the excerpt.

high positive Automating Low-Risk Code Review at Meta: RADAR, Risk Calibra... lines of code per human-landed diff (year-over-year growth)

We discuss implications for Information Systems (IS) design and propose future field evaluations.

Paper includes a discussion section outlining IS design implications and suggestions for future empirical/field work.

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... proposed implications and future research directions

The approach preserves statistical rigour, traceability, and nuanced Persevere/Iterate decisions when accelerating experimentation.

Reported outcomes of controlled simulations and description of system design that enforces statistical procedures and logging; stated in manuscript as findings.

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... statistical rigour, traceability, and decision quality in experimentation (Perse...

Logs render capabilities observable at the feature level, turning 'agentic AI' into a disciplined experimentation infrastructure rather than a generic assistant.

Implementation logs and descriptions from the Node.js instantiation reported in the paper; qualitative claim about observability and traceability at the feature level.

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... feature-level observability/traceability of experimentation activities

The Multi Agent System reduces time-to-validated-learning by roughly an order of magnitude while preserving statistical rigour, traceability, and nuanced Persevere/Iterate decisions.

Results from the controlled simulations reported in the paper (comparison between agentic multi-agent system and manual B-M-L cycles).

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... time-to-validated-learning (and preservation of statistical rigour, traceability...

Controlled simulations compare agentic and manual B-M-L cycles on feature ideas.

Reported controlled simulation experiments in the paper comparing agentic (multi-agent) and manual B-M-L cycles; methodological description present in manuscript.

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... comparison of agentic vs manual B-M-L cycles (experimentation performance metric...

We instantiate them in a Node.js package instrumenting a production-grade SaaS codebase.

Implementation artifact reported in the paper (Node.js package) and description of instrumentation on a production-grade SaaS codebase.

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... existence and instantiation of a Node.js package that instruments a SaaS codebas...

Drawing on the Dynamic Capabilities View, we derive fifteen meta-requirements and thirty-three design principles (consolidated into seven goal-directed groups) for sensing, seizing, reconfiguring, orchestration, and governance.

Design-theory derivation reported in the paper (counts of meta-requirements and design principles are stated in the manuscript).

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... number and organization of derived meta-requirements and design principles

We propose a multi-agent artefact that operationalises the Build–Measure–Learn (B-M-L) cycle as a closed-loop control system.

Design science study described in the paper; conceptual derivation and artifact instantiation (Node.js package) reported in the manuscript.

high positive Multi Agent Systems In The Lean Startup Cycle: Operationalis... operationalisation of the Build–Measure–Learn cycle as a closed-loop control sys...

This paper contributes a theoretically specified mediating mechanism in the algorithmic management and employee silence literature and advances a conceptual framework addressing this relationship in conventional non-platform manufacturing in an emerging economy context.

Author-stated contribution in the abstract summarising the conceptual/theoretical advancement made by the paper.

high positive Algorithmic Management and Acquiescent Silence: The Mediatin... theoretical contribution to literature

The paper advances three formal propositions linking algorithmic management, perceived voice futility, and acquiescent silence, and derives three HRM intervention pathways from the framework.

Explicit claims about the paper's contributions and outputs (theoretical propositions and HRM intervention pathways presented in the manuscript).

high positive Algorithmic Management and Acquiescent Silence: The Mediatin... theoretical propositions and recommended HRM interventions

Specific institutional conditions in Malaysian manufacturing SMEs — HRM informality, digital capability gaps, and technology–governance decoupling — structurally amplify the proposed mechanism linking algorithmic management to acquiescent silence.

Institutional argument developed in the paper (conceptual analysis of contextual factors in Malaysian SMEs; no reported empirical validation).

high positive Algorithmic Management and Acquiescent Silence: The Mediatin... amplification of mechanism (increased likelihood/intensity of perceived voice fu...

Algorithmic management frustrates employees' needs for autonomy, competence, and relatedness, generating a cognitive appraisal of futility that drives resignation-based acquiescent silence.

Theoretical argument in the paper using self-determination theory and organisational silence theory (conceptual reasoning; no empirical data reported).

high positive Algorithmic Management and Acquiescent Silence: The Mediatin... need frustration (autonomy/competence/relatedness) and acquiescent silence

Perceived voice futility is the mediating mechanism connecting algorithmic management to acquiescent silence in conventional manufacturing workplaces.

Conceptual framework developed in the paper drawing on self-determination theory and organisational silence theory (theoretical proposition, no primary empirical test reported).

high positive Algorithmic Management and Acquiescent Silence: The Mediatin... acquiescent silence (employee silence behaviour)

Algorithmic management systems are increasingly deployed in manufacturing small and medium-enterprises (SMEs) in Malaysia under the Industry 4.0 agenda.

Author statement in paper abstract; asserted based on observation/literature about Industry 4.0 adoption in Malaysia (conceptual/descriptive claim).

high positive Algorithmic Management and Acquiescent Silence: The Mediatin... deployment/adoption of algorithmic management systems

There is a need for privacy-preserving deployments and richer, structure-aware representations of human knowledge for practical use.

Authors' recommendation/conclusion drawn from observed accuracy/limitations and privacy considerations in using long-term Slack logs.

high positive Can AI Guess What You Know? Performance Comparison of Large ... requirement for privacy-preserving deployment practices and improved representat...

Gemini 2.5 Flash achieved the lowest error (MAE 21.13%).

Reported model evaluation results comparing MAE across models; Gemini 2.5 Flash reported as lowest with MAE 21.13%.

high positive Can AI Guess What You Know? Performance Comparison of Large ... mean absolute error (MAE) of skill estimates

We analyze 27,188 messages from 43 users to investigate whether LLMs can infer individual domain knowledge from long-term Slack logs.

Dataset description reported in the paper: 27,188 Slack messages from 43 users.

high positive Can AI Guess What You Know? Performance Comparison of Large ... dataset size and coverage (messages and users analyzed)

Our project website, including the leaderboard, dataset, and code, is available at https://dong7313.github.io/muse-benchmark/.

Statement in abstract and provided URL pointing to project artifacts.

high positive MUSE: Benchmarking Manufacturable, Functional, and Assemblab... availability of project website, leaderboard, dataset, and code

Together, MUSE provides a realistic benchmark and evaluation framework for advancing Text-to-CAD from geometric generation toward true engineering design.

Paper's stated contribution and intended purpose (abstract) and provision of dataset/benchmark artifacts via project website.

high positive MUSE: Benchmarking Manufacturable, Functional, and Assemblab... utility of benchmark and evaluation framework for advancing Text-to-CAD toward e...

« Prev 1 2 3 … 51 52 53 … 131 132 Next »