Evidence (6491 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	758	199	100	900	2007
Governance & Regulation	826	400	191	122	1563
Organizational Efficiency	777	193	124	84	1189
Technology Adoption Rate	635	233	124	97	1098
Research Productivity	422	128	57	336	954
Output Quality	476	179	59	47	761
Decision Quality	328	177	81	47	640
Firm Productivity	435	57	88	20	606
AI Safety & Ethics	218	277	65	33	599
Market Structure	180	170	123	24	502
Task Allocation	213	64	72	33	387
Skill Acquisition	170	61	61	17	309
Innovation Output	203	27	43	18	292
Employment Level	105	54	107	13	281
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	117	63	42	11	233
Firm Revenue	153	48	26	3	230
Task Completion Time	173	31	8	12	225
Inequality Measures	44	122	49	6	221
Worker Satisfaction	89	65	22	12	188
Error Rate	69	92	10	2	173
Regulatory Compliance	77	69	14	5	165
Automation Exposure	56	56	26	13	154
Training Effectiveness	94	21	13	19	149
Wages & Compensation	77	36	25	6	144
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	80	20	1	113
Hiring & Recruitment	52	7	8	3	70
Creative Output	31	18	8	3	61
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

The dataset, contexts, annotations, and evaluation harness are released publicly.

Paper states that dataset, contexts, annotations, and evaluation harness are released publicly (release / open-source claim).

high positive SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... public release / availability

A structured 2,000-token diff-with-summary prompt outperforms a 2,500-token full-context prompt (enriched with execution context, behaviour mapping, and test signatures) across all 8 models.

Direct prompt/context-size comparison across the 8 models on SWE-PRBench; reported that the 2,000-token diff-with-summary prompt yields better performance than the 2,500-token full-context prompt with extra enrichments.

high positive SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... model detection/performance under specific prompt/context designs

The LLM-as-judge framework used for evaluation is validated at kappa = 0.75.

Inter-judge validation reported in paper (agreement metric kappa reported as 0.75). Specific validation sample size not stated in the excerpt.

high positive SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... judge reliability / inter-annotator (or LLM-judge) agreement

Pull requests are drawn from active open-source repositories, filtered from 700 candidates using a Repository Quality Score.

Dataset curation procedure reported: initial pool of 700 candidate repositories/PRs filtered by a Repository Quality Score to produce the final benchmark.

high positive SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... data provenance / filtering (number of candidates filtered to final set)

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality.

Dataset construction described in paper: benchmark contains 350 pull requests with human annotations. Pull requests drawn from active open-source repositories and filtered from 700 candidates using a Repository Quality Score.

high positive SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... benchmark size and availability (350 human-annotated PRs)

The paper concludes by articulating expected outcomes for management practice and proposes a research agenda calling for future mixed-methods validation of the framework.

Stated conclusion and explicit call for mixed-methods validation; no validation results provided in this paper.

high positive Behavioral Factors as Determinants of Successful Scaling of ... guidance for management practice and roadmap for empirical validation

The review derives constructs, hypothesized links among them, and governance implications for managing and institutionalizing workplace AI.

Paper reports that reviewed sources were used to derive constructs and governance implications; this is a conceptual derivation rather than empirical testing.

high positive Behavioral Factors as Determinants of Successful Scaling of ... set of constructs, hypothesized relationships, and governance recommendations

The framework and synthesis can be used to diagnose patterns of disengagement and pilot-to-production failure in corporate AI initiatives.

Proposed analytical structure derived from literature synthesis and conceptual mapping; intended as a diagnostic tool but not empirically validated within this paper.

high positive Behavioral Factors as Determinants of Successful Scaling of ... ability to diagnose disengagement and failure modes

The paper integrates adoption frameworks (TAM and TOE) with evidence on human-AI interaction to produce a scaling-oriented conceptual framework for diagnosing disengagement and pilot-to-production failures.

Comparative conceptual analysis and framework building based on reviewed literature; no new empirical validation reported.

high positive Behavioral Factors as Determinants of Successful Scaling of ... diagnostic capacity for identifying causes of disengagement and pilot-to-product...

Integrating technological, human, and organizational capabilities is important to maximize the benefits of AI in smart manufacturing.

Conclusion based on thematic patterns in interviews, observations, and document analysis from purposively sampled supply chain and production professionals; identified as an implementation implication.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... realization of AI benefits / implementation success

Firms adopting AI-driven forecasting and inventory strategies can achieve higher operational agility, better strategic resource alignment, and maintain a competitive advantage in dynamic manufacturing contexts.

Synthesis and implications drawn from thematic analysis of interviews, site visits, and documents from purposively sampled industry practitioners; presented as study conclusions rather than quantitatively tested outcomes.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... operational agility / strategic alignment / competitive advantage

AI supports sustainability initiatives within manufacturing operations.

Thematic analysis of practitioner interviews and organizational documentation where respondents linked AI-based forecasting/inventory optimization to sustainability outcomes (e.g., waste reduction).

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... sustainability outcomes (e.g., waste reduction)

AI improves supply chain coordination among partners and internal functions.

Interview and document-based thematic findings from purposively sampled supply chain managers and industry experts reporting enhanced coordination following AI adoption.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... supply chain coordination

AI contributes to operational resilience in manufacturing supply chains.

Qualitative evidence from interviews and organizational documents indicating that AI-enabled forecasting and inventory controls improve firms' ability to adapt to disruptions; thematic analysis produced resilience as a reported benefit.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... operational resilience

Organizational readiness, skilled personnel, data quality, and robust technological infrastructure are critical factors influencing AI effectiveness.

Recurring themes identified via thematic analysis of semi-structured interviews with supply chain and production professionals, corroborated by observational site visits and organizational documents from purposive sample.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... AI effectiveness (implementation success/performance)

AI reduces excess inventory levels in manufacturing firms.

Thematic findings from interviews, site visits, and documents from industry experts and practitioners who reported decreased excess inventory following AI-driven forecasting and inventory optimization.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... excess inventory levels

AI reduces stockouts in manufacturing supply chains.

Practitioner accounts and organizational document evidence from purposive qualitative sampling and thematic analysis indicating fewer stockouts associated with AI-driven forecasting and inventory controls.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... incidence of stockouts

AI adoption reduces operational inefficiencies in manufacturing processes.

Thematic analysis of qualitative data (semi-structured interviews, site observations, organizational documents) from purposively sampled industry practitioners reporting reductions in inefficiencies after AI implementation.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... operational inefficiencies

AI supports proactive decision-making among supply chain and production stakeholders.

Qualitative reports from interviews and document review with supply chain managers, production planners, and industry experts; thematic analysis identified proactive decision-making as a theme associated with AI use.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... proactivity of decision-making

AI enables adaptive inventory management in manufacturing operations.

Findings from thematic analysis of semi-structured interviews with supply chain managers, production planners, and industry experts, plus observational site visits and organizational documents (purposive sampling).

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... adaptive inventory management capability

AI technologies enhance forecasting accuracy in smart manufacturing.

Qualitative evidence from purposive sample of supply chain managers, production planners, and industry experts gathered via semi-structured interviews, observational site visits, and organizational documents; analyzed using thematic analysis.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... forecasting accuracy

Our dataset is available at https://guide-bench.github.io.

Paper's statement providing a URL for dataset access.

high positive GUIDE: A Benchmark for Understanding and Assisting Users in ... dataset availability / accessibility

Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop).

Motivating claim in the paper's introduction/abstract, based on prior work and the authors' argument about potential application domains.

high positive GUIDE: A Benchmark for Understanding and Assisting Users in ... potential for GUI agents to assist users

Providing user context significantly improved the performance, raising help prediction by up to 50.2pp.

Experimental comparison reported in the paper showing differences in Help Prediction performance with and without provided user context; reported improvement magnitude of up to 50.2 percentage points.

high positive GUIDE: A Benchmark for Understanding and Assisting Users in ... improvement in help prediction accuracy when user context is provided

GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help.

Paper's benchmark/task definitions describing three evaluation tasks and their goals.

high positive GUIDE: A Benchmark for Understanding and Assisting Users in ... task definitions evaluating model capabilities (behavior detection, intent predi...

GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software.

Paper's dataset description: dataset construction of screen recordings, number of demonstrations, duration, participant expertise (novice), and inclusion of think-aloud narrations across 10 software.

high positive GUIDE: A Benchmark for Understanding and Assisting Users in ... dataset size and composition (hours, number of demonstrations, software covered)

Automatic speech recognition (ASR) has shown increasing potential to assist in the transcription of endangered language data.

Background claim in the paper, referring to advances in ASR and prior work suggesting utility for endangered-language transcription; stated as motivation rather than a novel empirical finding in this paper.

high positive Automatic Speech Recognition for Documenting Endangered Lang... utility/potential of ASR for endangered-language transcription

We train an ASR model that achieves a character error rate as low as 15%.

Reported quantitative evaluation of the trained ASR model on the constructed Ikema dataset (character error rate = 15%). Exact evaluation protocol, test set size, and train/test split not provided in the abstract.

high positive Automatic Speech Recognition for Documenting Endangered Lang... character error rate

We construct a {\totaldatasethours}-hour speech corpus from field recordings.

Stated in paper as an outcome of the authors' data-collection and corpus-construction effort from field recordings; no numeric value resolved in the provided text (placeholder present).

high positive Automatic Speech Recognition for Documenting Endangered Lang... size of speech corpus (hours)

With calibrated oversight that aligns accountability to real-world risks, AI can secure the profession’s future.

Normative/prognostic claim in the Article (argument that appropriate governance will preserve or strengthen the legal profession).

high positive Rewired: Reconceptualizing Legal Services for the AI Age long-term resilience/stability of the legal profession

With calibrated oversight that aligns accountability to real-world risks, AI can improve service quality in legal services.

Normative/prognostic claim in the Article (argument that governance plus AI yields quality improvements). No empirical effect sizes reported in the excerpt.

high positive Rewired: Reconceptualizing Legal Services for the AI Age service quality of legal services

While the risks of AI are real, they must not eclipse the opportunity: with calibrated oversight that aligns accountability to real-world risks, AI can expand access to legal services.

Normative claim and projected benefit argued by the authors (theoretical/argumentative; no empirical evidence in excerpt).

high positive Rewired: Reconceptualizing Legal Services for the AI Age expansion of access to legal services

The framework provides a roadmap for coordinated response across educational institutions, government agencies, and industry to ensure workforce resilience and domestic leadership in the emerging agentic finance era.

Authors' proposed integrated roadmap (prescriptive recommendation; no empirical testing or outcome measurement reported in the provided text).

high positive STRENGTHENING FINANCIAL WORKFORCE COMPETITIVENESS: A CURRICU... workforce resilience and domestic leadership in agentic finance

We develop a comprehensive government policy framework including: 1) Federal AI literacy mandates for post-secondary business education; 2) Department of Labor workforce retraining programs with income support for displaced financial professionals; 3) SEC and Treasury regulatory innovations creating market incentives for workforce development; 4) State-level workforce partnerships implementing regional transition support; and 5) Enhanced social safety nets for workers navigating career transitions during the estimated 5-15 year transformation period.

Author-presented policy framework and recommendations (policy design proposals and an asserted 5–15 year transformation timeframe; no empirical evaluation reported).

high positive STRENGTHENING FINANCIAL WORKFORCE COMPETITIVENESS: A CURRICU... policy adoption and worker support measures during technological transition

We propose a multi-layered integration strategy for higher education encompassing: 1) Foundational AI literacy modules for all business students; 2) A specialized "Agentic Financial Planning" course with hands-on labs; 3) AI-augmented redesign of core courses (Investments, Portfolio Management, Ethics); 4) Interdisciplinary project-based learning with Computer Science; and 5) A governance and policy module addressing regulatory compliance (NIST AI RMF, SEC regulations).

Proposed curricular framework presented by the authors (recommendation/proposal, not empirically tested within the paper).

high positive STRENGTHENING FINANCIAL WORKFORCE COMPETITIVENESS: A CURRICU... student AI-related skills and preparedness for agentic finance roles

The ultimate competitive edge lies in an organization's ability to treat AI not as a standalone tool, but as a core component of sustainable, long-term corporate strategy.

Concluding normative claim in the paper; presented as an interpretation/synthesis rather than supported by cited empirical evidence in the abstract.

high positive The AI Advantage: Strategic Innovation and Global Expansion ... competitive advantage derived from integrating AI into corporate strategy

Successful global expansion is no longer predicated solely on physical presence but on the deployment of scalable, localized AI models that navigate diverse regulatory, linguistic, and cultural landscapes.

Argumentative claim in the paper describing a strategic determinant for global expansion; no empirical sample or quantified outcomes presented in the abstract.

high positive The AI Advantage: Strategic Innovation and Global Expansion ... drivers of successful global expansion (physical presence vs. localized AI deplo...

AI hyper-personalizes customer engagement.

Declarative claim in the paper about AI's effect on customer engagement personalization; no experimental or observational data reported in the abstract.

high positive The AI Advantage: Strategic Innovation and Global Expansion ... degree of personalization in customer engagement

AI acts as an internal engine for operational agility by compressing R&D cycles.

Claim made in the paper asserting R&D cycle compression due to AI; no empirical data, sample size or quantitative measures provided in the abstract.

high positive The AI Advantage: Strategic Innovation and Global Expansion ... length/duration of R&D cycles (time-to-iteration)

The strategic focus has transitioned from mere process automation to autonomous orchestration, where multi-agent systems independently manage complex, cross-border operations and real-time decision-making.

Analytic statement from the paper describing an observed/argued shift in strategic focus; no empirical methodology or sample reported.

high positive The AI Advantage: Strategic Innovation and Global Expansion ... shift in strategic focus from automation to autonomous orchestration via multi-a...

Organizations leverage agentic workflows and domain-specific intelligence to catalyse strategic innovation and facilitate global expansion in the digital era.

Conceptual claim in the paper describing how organizations use specific AI capabilities; no empirical design or sample described in the abstract.

high positive The AI Advantage: Strategic Innovation and Global Expansion ... use of agentic workflows and domain-specific models to drive innovation and glob...

The rapid evolution of Artificial Intelligence (AI) has shifted from a disruptive trend to the fundamental operating layer of the modern enterprise.

Statement/assertion in the paper (conceptual/positioning claim); no empirical method, sample size, or statistical analysis reported in the abstract.

high positive The AI Advantage: Strategic Innovation and Global Expansion ... role of AI in enterprise operations (from peripheral/disruptive to core/operatin...

Transparency’s effectiveness in promoting data-sharing is amplified by, and dependent upon, user trust; fostering trust in AI may be a more vital prerequisite for data-sharing than implementing transparent designs.

Synthesis of experimental findings (N=240): transparency increased willingness only among users with pre-existing trust; null effect of transparency alone on actual sharing; authors conclude that trust moderates transparency effects and recommend focusing on trust-building.

high positive Understanding Data-Sharing with AI Systems: The Roles of Tra... recommendation/policy implication regarding trust vs transparency for promoting ...

Immediate sharing decisions were largely driven by intuitive System 1 processing rather than deliberative evaluation (System 2).

Interpretation of the pattern in experimental data (N=240): high, similar sharing rates across conditions despite differing stated willingness-to-share and measured privacy concerns; authors attribute this to dual-process dynamics (System 1 driving immediate behavior).

high positive Understanding Data-Sharing with AI Systems: The Roles of Tra... dominance of intuitive (System 1) processing in immediate sharing behavior

The positive effect of transparency on willingness to share was contingent on pre-existing user trust in AI, particularly for white-box systems.

Moderation analyses reported from the experiment (N=240): interaction between transparency (white-box vs black-box) and measured pre-existing trust in AI showed increased willingness-to-share only among users with higher trust, with the effect most pronounced for white-box systems.

high positive Understanding Data-Sharing with AI Systems: The Roles of Tra... willingness to share (stated/deliberative sharing intention)

We conducted a pre-registered online experiment (N=240) where participants interacted with a fictional sleep-optimization app and were randomly assigned to scenarios where data was processed by either a human expert, a transparent white-box AI, or an opaque black-box AI.

Pre-registered online experimental design described in paper; random assignment to three processing-entity conditions (human, white-box AI, black-box AI); sample size reported as N=240; measured outcomes included actual data-sharing and willingness to share, plus trust and privacy concerns.

high positive Understanding Data-Sharing with AI Systems: The Roles of Tra... experimental manipulation / treatment assignment and measurement of sharing outc...

A Metacognitive Co-Regulation Agent (in CRDAL) assists the Design Agent in metacognition to mitigate design fixation, thereby improving system performance for engineering design tasks.

Mechanistic claim supported by the paper's experimental results on the battery pack design problem showing CRDAL outperforming SRL and RWL; detailed measures of fixation reduction not provided in the excerpt.

high positive Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regul... reduction in design fixation / improvement in performance due to co-regulation

The CRDAL system navigated through the latent design space more effectively than both SRL and RWL.

Empirical analysis on the battery pack design task comparing latent-space trajectories/exploration between CRDAL, SRL, and RWL; details on how 'more effectively' was quantified and sample size are not provided in the excerpt.

high positive Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regul... quality/coverage of exploration in latent design space

The CRDAL system achieves better design performance without significantly increasing the computational cost compared to SRL and RWL.

Empirical claim based on experiments on the battery pack design problem comparing computational cost across CRDAL, SRL, and RWL; exact computational metrics and sample size not provided in the excerpt.

high positive Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regul... computational cost (efficiency/resource usage) of design-generation process

In the battery pack design problem examined here, the CRDAL system generates designs with better performance compared to a plain Ralph Wiggum Loop (RWL) and the metacognitively self-assessing Self-Regulation Loop (SRL).

Empirical comparison on a battery pack design task between CRDAL, SRL, and RWL reported in the paper; exact number of test instances or runs not stated in the excerpt.

high positive Supervising Ralph Wiggum: Exploring a Metacognitive Co-Regul... design performance (battery pack designs)

« Prev 1 2 3 … 81 82 83 … 129 130 Next »