Evidence (3103 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	378	106	59	455	1007
Governance & Regulation	379	176	116	58	739
Research Productivity	240	96	34	294	668
Organizational Efficiency	370	82	63	35	553
Technology Adoption Rate	296	118	66	29	513
Firm Productivity	277	34	68	10	394
AI Safety & Ethics	117	177	44	24	364
Output Quality	244	61	23	26	354
Market Structure	107	123	85	14	334
Decision Quality	168	74	37	19	301
Fiscal & Macroeconomic	75	52	32	21	187
Employment Level	70	32	74	8	186
Skill Acquisition	89	32	39	9	169
Firm Revenue	96	34	22	—	152
Innovation Output	106	12	21	11	151
Consumer Welfare	70	30	37	7	144
Regulatory Compliance	52	61	13	3	129
Inequality Measures	24	68	31	4	127
Task Allocation	75	11	29	6	121
Training Effectiveness	55	12	12	16	96
Error Rate	42	48	6	—	96
Worker Satisfaction	45	32	11	6	94
Task Completion Time	78	5	4	2	89
Wages & Compensation	46	13	19	5	83
Team Performance	44	9	15	7	76
Hiring & Recruitment	39	4	6	3	52
Automation Exposure	18	17	9	5	50
Job Displacement	5	31	12	—	48
Social Protection	21	10	6	2	39
Developer Productivity	29	3	3	1	36
Worker Turnover	10	12	—	3	25
Skill Obsolescence	3	19	2	—	24
Creative Output	15	5	3	1	24
Labor Share of Income	10	4	9	—	23

Human Ai Collab Remove filter

Historical precedents from past technological revolutions suggest that innovation tends to expand, rather than shrink, the scope of economic activity and employment in the long run.

Paper draws on analysis of economic history (qualitative historical analysis implied; no specific historical datasets or sample sizes provided in the abstract).

high positive AI Civilization and the Transformation of Work scope of economic activity and long-run employment levels

The paper studies principal-agent alignment using revealed preference techniques.

Stated methodological approach in the abstract; implies analytical use of revealed-preference methods for identification.

high positive A Revealed Preference Framework for AI Alignment methodological approach (use of revealed preference techniques to study alignmen...

The AI's alignment (similarity of human and AI preferences) can be generically identified in the field setting, where only AI choices are observed.

Analytical/theoretical identification result presented in the paper using revealed preference techniques (as stated in abstract); no empirical sample reported in the abstract.

high positive A Revealed Preference Framework for AI Alignment identifiability of AI alignment parameter from observed AI-only choices (field s...

The AI's alignment (similarity of human and AI preferences) can be generically identified in the laboratory setting, where both human and AI choices are observed.

Analytical/theoretical identification result presented in the paper using revealed preference techniques (as stated in abstract); no empirical sample reported in the abstract.

high positive A Revealed Preference Framework for AI Alignment identifiability of AI alignment parameter from observed human and AI choices (la...

The paper introduces the Luce Alignment Model, where the AI's choices are a mixture of two Luce rules, one reflecting the human's preferences and the other the AI's.

Paper proposes and defines a new theoretical model (model specification described in abstract).

high positive A Revealed Preference Framework for AI Alignment model specification of AI choice behavior (mixture of Luce rules)

Human decision makers increasingly delegate choices to AI agents.

Stated as motivation in the abstract; no empirical data or sample described in the provided text.

high positive A Revealed Preference Framework for AI Alignment frequency of delegation of choices to AI agents

By formalizing the end-to-end transaction model together with its asset and incentive layers, EpochX reframes agentic AI as an organizational design problem focused on infrastructures where verifiable work leaves persistent, reusable artifacts and value flows support durable human-agent collaboration.

Theoretical framing and normative claim in the paper; no empirical evaluation demonstrating that this reframing yields measurable benefits.

high positive EpochX: Building the Infrastructure for an Emergent Agent Ci... organizational framing and potential for durable human-agent collaboration

Credits lock task bounties, allow budget delegation, settle rewards upon acceptance, and compensate creators when verified assets are reused.

Functional description of the credit mechanics and settlement rules within the proposed EpochX marketplace; presented as part of system design without empirical settlement or user-behavior data.

high positive EpochX: Building the Infrastructure for an Emergent Agent Ci... incentive flows, reward settlement, and compensation for asset reuse

EpochX introduces a native credit mechanism to make participation economically viable under real compute costs.

Proposed economic/incentive mechanism described in the paper; no empirical cost analysis, pricing model validation, or participant economic outcomes reported.

high positive EpochX: Building the Infrastructure for an Emergent Agent Ci... economic viability of participation under compute costs

These assets are stored with explicit dependency structure, enabling retrieval, composition, and cumulative improvement over time.

Design-level assertion about data model/asset graph in the EpochX proposal; no empirical results demonstrating retrieval/composition or measured cumulative improvement.

high positive EpochX: Building the Infrastructure for an Emergent Agent Ci... asset retrieval/composition and cumulative improvement

Each completed transaction can produce reusable ecosystem assets, including skills, workflows, execution traces, and distilled experience.

Architectural claim about artifacts produced per transaction in EpochX; described as a design goal rather than backed by empirical evidence or deployment data.

high positive EpochX: Building the Infrastructure for an Emergent Agent Ci... creation of reusable assets (skills, workflows, traces, distilled experience)

Claimed tasks can be decomposed into subtasks and executed through an explicit delivery workflow with verification and acceptance.

Design description of the workflow and verification/acceptance mechanisms in the proposed EpochX architecture; no empirical testing or metrics reported.

high positive EpochX: Building the Infrastructure for an Emergent Agent Ci... task execution workflow, verification and acceptance outcomes

EpochX treats humans and agents as peer participants who can post tasks or claim them.

Architectural/design specification in the paper describing participant roles and interactions; no empirical validation provided.

high positive EpochX: Building the Infrastructure for an Emergent Agent Ci... task posting and claiming behavior / task allocation model

We introduce EpochX, a credits-native marketplace infrastructure for human-agent production networks.

System/design description in the paper (architectural proposal); no deployment, user study, or evaluation results reported.

high positive EpochX: Building the Infrastructure for an Emergent Agent Ci... marketplace infrastructure availability / adoption potential

Google has been pioneering machine learning usage across dozens of products.

Contextual statement in the abstract about the organization's activity; asserted without empirical detail in abstract.

high positive A Multi-agent AI System for Deep Learning Model Migration fr... extent of ML usage across Google products

The techniques and approaches described can be generalized for other framework migrations and general code transformation tasks.

Authors' stated expectation/generalization claim in the abstract; no empirical evidence or cross-framework experiments reported in the abstract.

high positive A Multi-agent AI System for Deep Learning Model Migration fr... generalizability to other framework migrations / code transformation tasks

The system creates a virtuous circle where effectively AI supports its own development workflow.

Conceptual claim supported by the system's design and reported improvements that enable iterative AI-assisted development; described qualitatively in the paper.

high positive A Multi-agent AI System for Deep Learning Model Migration fr... self-supporting/iterative improvement of AI-assisted development workflow

Our approach dramatically reduces the time (6.4x-8x speedup) for deep learning model migrations.

Quantitative speedup figure reported in the paper's abstract (6.4x-8x); likely based on measured migration times on demonstrated cases, though the abstract does not state sample size or exact experimental setup.

high positive A Multi-agent AI System for Deep Learning Model Migration fr... time required to perform deep learning model migrations

The system accelerates code migrations in a large hyperscaler environment on commercial real-world use-cases.

Reported demonstration and evaluation in a hyperscaler (commercial) environment using real-world cases as described in the paper; no detailed sample size given in abstract.

high positive A Multi-agent AI System for Deep Learning Model Migration fr... speed of code migrations in commercial/hyperscaler environment

We define quality metrics and AI-based judges that accelerate development when the code to evaluate has no tests and has to adhere to strict style and dependency requirements.

Design and implementation of quality metrics and AI-based judges described in the paper; claimed acceleration of development workflow (no numeric quantification in abstract).

high positive A Multi-agent AI System for Deep Learning Model Migration fr... development speed / time to develop when evaluating untested code under strict s...

We built an AI-based multi-agent system to support automatic migration of TensorFlow-based deep learning models into JAX-based ones.

System implementation and description in the paper; demonstration on real-world code migration tasks in a hyperscaler environment (qualitative description in abstract).

high positive A Multi-agent AI System for Deep Learning Model Migration fr... existence and functioning of an AI-based migration system

The dataset, contexts, annotations, and evaluation harness are released publicly.

Paper states that dataset, contexts, annotations, and evaluation harness are released publicly (release / open-source claim).

high positive SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... public release / availability

A structured 2,000-token diff-with-summary prompt outperforms a 2,500-token full-context prompt (enriched with execution context, behaviour mapping, and test signatures) across all 8 models.

Direct prompt/context-size comparison across the 8 models on SWE-PRBench; reported that the 2,000-token diff-with-summary prompt yields better performance than the 2,500-token full-context prompt with extra enrichments.

high positive SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... model detection/performance under specific prompt/context designs

The LLM-as-judge framework used for evaluation is validated at kappa = 0.75.

Inter-judge validation reported in paper (agreement metric kappa reported as 0.75). Specific validation sample size not stated in the excerpt.

high positive SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... judge reliability / inter-annotator (or LLM-judge) agreement

Pull requests are drawn from active open-source repositories, filtered from 700 candidates using a Repository Quality Score.

Dataset curation procedure reported: initial pool of 700 candidate repositories/PRs filtered by a Repository Quality Score to produce the final benchmark.

high positive SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... data provenance / filtering (number of candidates filtered to final set)

We introduce SWE-PRBench, a benchmark of 350 pull requests with human-annotated ground truth for evaluating AI code review quality.

Dataset construction described in paper: benchmark contains 350 pull requests with human annotations. Pull requests drawn from active open-source repositories and filtered from 700 candidates using a Repository Quality Score.

high positive SWE-PRBench: Benchmarking AI Code Review Quality Against Pul... benchmark size and availability (350 human-annotated PRs)

The paper concludes by articulating expected outcomes for management practice and proposes a research agenda calling for future mixed-methods validation of the framework.

Stated conclusion and explicit call for mixed-methods validation; no validation results provided in this paper.

high positive Behavioral Factors as Determinants of Successful Scaling of ... guidance for management practice and roadmap for empirical validation

The review derives constructs, hypothesized links among them, and governance implications for managing and institutionalizing workplace AI.

Paper reports that reviewed sources were used to derive constructs and governance implications; this is a conceptual derivation rather than empirical testing.

high positive Behavioral Factors as Determinants of Successful Scaling of ... set of constructs, hypothesized relationships, and governance recommendations

The framework and synthesis can be used to diagnose patterns of disengagement and pilot-to-production failure in corporate AI initiatives.

Proposed analytical structure derived from literature synthesis and conceptual mapping; intended as a diagnostic tool but not empirically validated within this paper.

high positive Behavioral Factors as Determinants of Successful Scaling of ... ability to diagnose disengagement and failure modes

The paper integrates adoption frameworks (TAM and TOE) with evidence on human-AI interaction to produce a scaling-oriented conceptual framework for diagnosing disengagement and pilot-to-production failures.

Comparative conceptual analysis and framework building based on reviewed literature; no new empirical validation reported.

high positive Behavioral Factors as Determinants of Successful Scaling of ... diagnostic capacity for identifying causes of disengagement and pilot-to-product...

Integrating technological, human, and organizational capabilities is important to maximize the benefits of AI in smart manufacturing.

Conclusion based on thematic patterns in interviews, observations, and document analysis from purposively sampled supply chain and production professionals; identified as an implementation implication.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... realization of AI benefits / implementation success

Firms adopting AI-driven forecasting and inventory strategies can achieve higher operational agility, better strategic resource alignment, and maintain a competitive advantage in dynamic manufacturing contexts.

Synthesis and implications drawn from thematic analysis of interviews, site visits, and documents from purposively sampled industry practitioners; presented as study conclusions rather than quantitatively tested outcomes.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... operational agility / strategic alignment / competitive advantage

AI supports sustainability initiatives within manufacturing operations.

Thematic analysis of practitioner interviews and organizational documentation where respondents linked AI-based forecasting/inventory optimization to sustainability outcomes (e.g., waste reduction).

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... sustainability outcomes (e.g., waste reduction)

AI improves supply chain coordination among partners and internal functions.

Interview and document-based thematic findings from purposively sampled supply chain managers and industry experts reporting enhanced coordination following AI adoption.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... supply chain coordination

AI contributes to operational resilience in manufacturing supply chains.

Qualitative evidence from interviews and organizational documents indicating that AI-enabled forecasting and inventory controls improve firms' ability to adapt to disruptions; thematic analysis produced resilience as a reported benefit.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... operational resilience

Organizational readiness, skilled personnel, data quality, and robust technological infrastructure are critical factors influencing AI effectiveness.

Recurring themes identified via thematic analysis of semi-structured interviews with supply chain and production professionals, corroborated by observational site visits and organizational documents from purposive sample.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... AI effectiveness (implementation success/performance)

AI reduces excess inventory levels in manufacturing firms.

Thematic findings from interviews, site visits, and documents from industry experts and practitioners who reported decreased excess inventory following AI-driven forecasting and inventory optimization.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... excess inventory levels

AI reduces stockouts in manufacturing supply chains.

Practitioner accounts and organizational document evidence from purposive qualitative sampling and thematic analysis indicating fewer stockouts associated with AI-driven forecasting and inventory controls.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... incidence of stockouts

AI adoption reduces operational inefficiencies in manufacturing processes.

Thematic analysis of qualitative data (semi-structured interviews, site observations, organizational documents) from purposively sampled industry practitioners reporting reductions in inefficiencies after AI implementation.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... operational inefficiencies

AI supports proactive decision-making among supply chain and production stakeholders.

Qualitative reports from interviews and document review with supply chain managers, production planners, and industry experts; thematic analysis identified proactive decision-making as a theme associated with AI use.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... proactivity of decision-making

AI enables adaptive inventory management in manufacturing operations.

Findings from thematic analysis of semi-structured interviews with supply chain managers, production planners, and industry experts, plus observational site visits and organizational documents (purposive sampling).

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... adaptive inventory management capability

AI technologies enhance forecasting accuracy in smart manufacturing.

Qualitative evidence from purposive sample of supply chain managers, production planners, and industry experts gathered via semi-structured interviews, observational site visits, and organizational documents; analyzed using thematic analysis.

high positive Assessing the Effectiveness of AI-Driven Techniques for Dema... forecasting accuracy

Our dataset is available at https://guide-bench.github.io.

Paper's statement providing a URL for dataset access.

high positive GUIDE: A Benchmark for Understanding and Assisting Users in ... dataset availability / accessibility

Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop).

Motivating claim in the paper's introduction/abstract, based on prior work and the authors' argument about potential application domains.

high positive GUIDE: A Benchmark for Understanding and Assisting Users in ... potential for GUI agents to assist users

Providing user context significantly improved the performance, raising help prediction by up to 50.2pp.

Experimental comparison reported in the paper showing differences in Help Prediction performance with and without provided user context; reported improvement magnitude of up to 50.2 percentage points.

high positive GUIDE: A Benchmark for Understanding and Assisting Users in ... improvement in help prediction accuracy when user context is provided

GUIDE defines three tasks - (i) Behavior State Detection, (ii) Intent Prediction, and (iii) Help Prediction that test a model's ability to recognize behavior state, reason about goals, and decide when and how to help.

Paper's benchmark/task definitions describing three evaluation tasks and their goals.

high positive GUIDE: A Benchmark for Understanding and Assisting Users in ... task definitions evaluating model capabilities (behavior detection, intent predi...

GUIDE consists of 67.5 hours of screen recordings from 120 novice user demonstrations with think-aloud narrations, across 10 software.

Paper's dataset description: dataset construction of screen recordings, number of demonstrations, duration, participant expertise (novice), and inclusion of think-aloud narrations across 10 software.

high positive GUIDE: A Benchmark for Understanding and Assisting Users in ... dataset size and composition (hours, number of demonstrations, software covered)

Automatic speech recognition (ASR) has shown increasing potential to assist in the transcription of endangered language data.

Background claim in the paper, referring to advances in ASR and prior work suggesting utility for endangered-language transcription; stated as motivation rather than a novel empirical finding in this paper.

high positive Automatic Speech Recognition for Documenting Endangered Lang... utility/potential of ASR for endangered-language transcription

We train an ASR model that achieves a character error rate as low as 15%.

Reported quantitative evaluation of the trained ASR model on the constructed Ikema dataset (character error rate = 15%). Exact evaluation protocol, test set size, and train/test split not provided in the abstract.

high positive Automatic Speech Recognition for Documenting Endangered Lang... character error rate

We construct a {\totaldatasethours}-hour speech corpus from field recordings.

Stated in paper as an outcome of the authors' data-collection and corpus-construction effort from field recordings; no numeric value resolved in the provided text (placeholder present).

high positive Automatic Speech Recognition for Documenting Endangered Lang... size of speech corpus (hours)

« Prev 1 2 3 … 16 17 18 … 62 63 Next »