Evidence (13661 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	740	192	95	871	1945
Governance & Regulation	796	388	185	119	1512
Organizational Efficiency	765	186	123	82	1166
Technology Adoption Rate	610	227	121	95	1061
Research Productivity	409	121	56	331	928
Output Quality	464	174	58	47	743
Decision Quality	318	173	75	42	615
Firm Productivity	432	55	88	20	601
AI Safety & Ethics	214	273	65	33	589
Market Structure	175	165	120	24	489
Task Allocation	206	64	70	31	376
Skill Acquisition	161	57	57	16	291
Innovation Output	201	27	41	18	288
Fiscal & Macroeconomic	130	69	43	26	275
Employment Level	104	50	105	13	274
Consumer Welfare	116	62	42	11	231
Firm Revenue	149	45	26	3	223
Inequality Measures	43	120	49	6	218
Task Completion Time	164	29	8	12	214
Worker Satisfaction	89	60	20	12	181
Error Rate	69	89	9	2	169
Regulatory Compliance	74	67	14	4	159
Training Effectiveness	91	19	13	19	144
Wages & Compensation	77	33	25	6	141
Team Performance	86	17	27	9	140
Automation Exposure	49	50	22	12	136
Developer Productivity	91	17	14	5	128
Job Displacement	12	80	19	1	112
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	16	7	2	57
Skill Obsolescence	5	43	6	1	55
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Commonly reported gains include the automation of trivial and repetitive tasks.

Multiple studies in the review report that LLM-assistants automate mundane programming tasks.

high positive The Impact of LLM-Assistants on Software Developer Productiv... automation of low-complexity tasks / developer time freed

Commonly reported gains include minimized code search due to LLM assistance.

Synthesis of study findings noting reductions in developer time spent searching for code or answers.

high positive The Impact of LLM-Assistants on Software Developer Productiv... time/effort spent searching for code or information

Commonly reported gains from LLM-assistants include accelerated development (faster task completion).

Multiple included studies report faster development workflows and reduced time-to-complete tasks, as synthesized in the review.

high positive The Impact of LLM-Assistants on Software Developer Productiv... task completion time / development speed

The majority of reviewed studies report considerable benefits from LLM-assistants.

Synthesis of findings across the 39 included peer-reviewed studies as reported in the review.

high positive The Impact of LLM-Assistants on Software Developer Productiv... overall reported impact on developer productivity

Comparing the verbal-profile setting to a numeric-budget condition with confidentiality instructions cleanly isolates role coherence as distinct from instruction-following failure.

Experimental comparison between verbal-profile condition and numeric-budget condition with confidentiality instructions; result claimed to isolate mechanism (role coherence) from mere instruction-following failure.

high positive When Agents Shop for You: Role Coherence in AI-Mediated Mark... mechanism attribution (role coherence vs. instruction-following failure) for obs...

In an experiment where a language-model buyer agent shops on behalf of a verbal consumer profile, seller-side inference from dialogue alone recovers willingness to pay nearly one-for-one.

Reported experimental result using a language-model buyer agent interacting on behalf of a verbal consumer profile; experimental comparison described in paper excerpt (specific sample size and statistical details not provided in the excerpt).

high positive When Agents Shop for You: Role Coherence in AI-Mediated Mark... accuracy of seller-side inference of willingness to pay (recovery of WTP)

Consumers are increasingly delegating purchase decisions to AI agents, providing natural-language descriptions of their preferences and identity.

Asserted in paper's introduction/abstract as a background trend; no empirical sample or citation provided in the excerpt.

high positive When Agents Shop for You: Role Coherence in AI-Mediated Mark... use of AI agents for purchase delegation / prevalence of natural-language prefer...

Capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.

Recommendation based on the authors' empirical deployment and analysis of failure modes and mitigation effectiveness across the end-to-end pipeline.

high positive Operating-Layer Controls for Onchain Language-Model Agents U... evaluation scope for capital-managing agents

Targeted harness changes increased capital deployment from 42.9% to 78.0% in an affected test population.

A/B or pre/post testing in an affected test population measuring percentage of capital deployed before and after harness changes.

high positive Operating-Layer Controls for Onchain Language-Model Agents U... capital deployment rate

Targeted harness changes reduced fee-led observations from 32.5% to below 10% in an affected test population.

A/B or pre/post testing in an affected test population measuring incidence of fee-led observations before and after harness changes.

high positive Operating-Layer Controls for Onchain Language-Model Agents U... incidence of fee-led observations

Targeted harness changes reduced fabricated sell rules from 57% to 3% in an affected test population.

A/B or pre/post testing in an affected test population measuring incidence of fabricated sell-rule observations before and after harness changes (percentage rates reported).

high positive Operating-Layer Controls for Onchain Language-Model Agents U... incidence of fabricated sell rules

Policy-valid submitted transactions settled with 99.9% success.

Settlement logs comparing policy-valid submitted transactions to successful onchain settlements.

high positive Operating-Layer Controls for Onchain Language-Model Agents U... settlement success rate for policy-valid submissions

Expert validation established strong relevance and practical utility for the framework, with a mean score of 4.6/5.

Structured validation exercise with five domain experts in AI ethics, corporate governance, and fintech regulation; paper reports the mean validation score as 4.6/5.

high positive Corporate-Governance-Driven Algorithmic Fairness in SME Fint... perceived relevance and practical utility of the framework (expert validation sc...

Analysis revealed four foundational governance pillars: Accountability, Transparency, Fairness, and Compliance.

Theme extraction from the SLR of 45 peer-reviewed publications (2022-2025) reported in the paper; these four pillars are presented as the core components of the proposed framework.

high positive Corporate-Governance-Driven Algorithmic Fairness in SME Fint... identification of governance pillars for algorithmic fairness

The study develops and validates an integrated conceptual framework that incorporates corporate governance principles with mechanisms for algorithmic fairness to foster ethical outcomes in SME fintech lending.

Two-phase research approach described in paper: (1) systematic literature review (45 peer-reviewed publications, 2022-2025) and (2) structured validation with five domain experts in AI ethics, corporate governance, and fintech regulation.

high positive Corporate-Governance-Driven Algorithmic Fairness in SME Fint... existence and validated relevance of an integrated governance-fairness framework

AI-driven credit assessment platforms promise greater efficiency in fintech lending.

Statement in paper (conceptual claim); supported by related literature cited in the SLR of 45 papers but no empirical efficiency metric reported in this paper.

high positive Corporate-Governance-Driven Algorithmic Fairness in SME Fint... efficiency of credit assessment processes

The rapid growth of fintech lending has reshaped financial access for SMEs through AI-driven credit assessment platforms.

Assertion in paper's background; positioned as established context for study (no specific empirical estimate given). The paper's SLR (45 peer-reviewed publications, 2022-2025) is presented as the literature basis for context.

high positive Corporate-Governance-Driven Algorithmic Fairness in SME Fint... financial access for SMEs

To a lesser extent, fears of AI automation drive demand for schemes that guarantee income regardless of employment status.

Findings from the 2024 OECD 'Risks that Matter' survey reported in the paper (survey-based measure of support for income-guarantee schemes conditional on fear of automation).

high positive AI, the Future of Work, and the Politics of the Welfare Stat... public support for income-guarantee schemes (e.g., universal basic income)

Rather than increasing support for traditional interventions such as unemployment benefits and training programs, these fears primarily drive demand for measures that preserve the social role of work and protect it from automation, such as robot taxes.

Results from the 2024 OECD 'Risks that Matter' public opinion survey analyzed in the paper (survey-based association between fear and policy preferences).

high positive AI, the Future of Work, and the Politics of the Welfare Stat... public support for policies that protect the social role of work (e.g., robot ta...

Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.

Author statement of intent to release code and assets upon paper acceptance.

high positive Benchmarking Complex Multimodal Document Processing Pipeline... open-source release of materials

The paper describes three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end.

Author statement listing three proposed reference architectures and noting they are not yet integrated end-to-end.

high positive Benchmarking Complex Multimodal Document Processing Pipeline... proposed system architectures (descriptive)

Factual accuracy on stated claims is 85.5%.

Reported accuracy measurement on 'stated claims' in generated outputs from systems evaluated on EnterpriseDocBench. Details on annotator process and sample size not included in excerpt.

high positive Benchmarking Complex Multimodal Document Processing Pipeline... factual accuracy (fraction of stated claims judged factually correct)

Both hybrid and BM25 beat dense embedding (dense embedding nDCG@5 = 0.83).

Reported nDCG@5 values for three retrieval approaches on the benchmark (values quoted in paper).

high positive Benchmarking Complex Multimodal Document Processing Pipeline... retrieval relevance (nDCG@5)

Hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91).

Empirical evaluation on EnterpriseDocBench using nDCG@5 as reported metric in paper. Exact query count or folds not provided in the excerpt.

high positive Benchmarking Complex Multimodal Document Processing Pipeline... retrieval relevance (nDCG@5)

We ran three pipelines through it: BM25, dense embedding, and a hybrid, all using the same GPT-5 generator.

Method statement describing experimental pipelines evaluated on EnterpriseDocBench (three retrieval variants combined with a shared generator).

high positive Benchmarking Complex Multimodal Document Processing Pipeline... evaluation of retrieval pipelines with a shared generator

The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot).

Author description of corpus composition (number of domains and pilot coverage). No document-count supplied in provided text.

high positive Benchmarking Complex Multimodal Document Processing Pipeline... dataset domain coverage

We built EnterpriseDocBench to evaluate parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness on the same corpus.

Description of dataset/benchmark creation and stated design goals in paper (author-developed benchmark covering four stages).

high positive Benchmarking Complex Multimodal Document Processing Pipeline... system-level evaluation across parse/index/retrieve/generate stages

Most enterprise document AI today is a pipeline: parse, index, retrieve, generate.

Author assertion about prevailing architecture of enterprise document-AI systems (introductory observation in paper). No empirical sample size or systematic survey reported in text provided.

high positive Benchmarking Complex Multimodal Document Processing Pipeline... prevalence of pipeline architecture

We release the full code base and a richly annotated dataset to support reproducible research on adaptive VCAs.

Paper statement announcing release of code and dataset.

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... availability of codebase and annotated dataset

The recommender achieved high relevance (MRR@1=0.75).

Reported offline/online recommender evaluation in the paper using Mean Reciprocal Rank at 1 (MRR@1) metric; presumably computed over recommendations in the study (711 conversations).

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... recommendation relevance (MRR@1)

Step-by-step guidance improved pleasantness and reduced user burden.

User-reported measures collected in the controlled study (likely subjective ratings across participants/conversations).

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... pleasantness (user satisfaction) and user burden

Device-level evidence increased correct resolutions from about 50% to over 90% relative to an LLM-only baseline.

Controlled study comparing SecMate with device-level diagnostic evidence to an LLM-only baseline; reported results across 144 participants / 711 conversations.

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... correct resolutions (successful troubleshooting)

Service specificity is achieved through a proactive, context-aware recommender.

System description and recommender component evaluation in the paper.

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... use of a proactive, context-aware recommender for service specificity

User specificity relies on implicit proficiency inference and profile-aware troubleshooting.

System design and algorithmic description in the paper explaining user-proficiency inference and profile-aware components.

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... ability to infer user proficiency and use profiles for troubleshooting

Device specificity is provided by a lightweight local diagnostic utility.

System design and implementation details reported in the paper describing the diagnostic utility component.

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... presence and role of a local diagnostic utility for device specificity

We present SecMate, a multi-agent VCA for cybersecurity troubleshooting that integrates device, user, and service specificity from conversational and device-level signals.

System description and architecture presented in the paper (design and implementation of SecMate).

high positive SecMate: Multi-Agent Adaptive Cybersecurity Troubleshooting ... system capability to integrate device, user, and service specificity

Make is most compelling for commodity utilities and for differentiating custom applications in the AI era.

Paper's typology and normative recommendation derived from conceptual analysis (no empirical validation reported).

high positive The Buy-or-Build Decision, Revisited: How Agentic AI Changes... relative attractiveness of in-house development (Make) across application catego...

AI fundamentally transforms the governance properties of the Make option, shifting it from Williamson's pure hierarchy to a hybrid governance form that combines code ownership with external AI infrastructure dependency.

Conceptual argument combining transaction cost economics, resource-based view, and assessment of AI infrastructure characteristics (no empirical testing reported).

high positive The Buy-or-Build Decision, Revisited: How Agentic AI Changes... governance form of in-house software development (Make)

The 'SaaSocalypse' narrative predicts that AI will render large segments of the Software-as-a-Service market obsolete by enabling firms to build software in-house at a fraction of historical cost.

Statement summarizing an extant narrative in industry and literature (paper cites/describes this narrative; no empirical test in the paper).

high positive The Buy-or-Build Decision, Revisited: How Agentic AI Changes... obsolescence of SaaS offerings / shift from buy to make

Advances in generative artificial intelligence, particularly agentic coding systems capable of autonomous software development, are disrupting the economics of the make-or-buy decision for enterprise applications.

Paper's conceptual analysis combining transaction cost economics, resource-based view, and assessment of current AI capabilities (no empirical sample reported).

high positive The Buy-or-Build Decision, Revisited: How Agentic AI Changes... economics of the make-or-buy decision for enterprise applications

Empirically, the decomposition confirms a speculative peak confined to December 1999–March 2000 in the dot-com episode.

Empirical application of the proposed decomposition and bubble test to historical asset price data covering the dot-com episode (data analysis reported in the paper).

high positive General-Purpose Technology and Speculative Bubble Detection timing and presence of a speculative peak during the dot-com episode

A fundamental-versus-speculative decomposition that projects prices onto observable technology proxies and applies the bubble test to the residual corrects for the contamination.

Methodological proposal described in the paper; presented as a corrective procedure (projection of prices on technology proxies and testing residuals).

high positive General-Purpose Technology and Speculative Bubble Detection ability to separate fundamental-driven price movements from speculative componen...

Embedding a hump-shaped technology shock in the Campbell-Shiller present-value model, the fundamental price becomes locally explosive during adoption.

Analytical/theoretical proof derived from the modified Campbell-Shiller present-value model with a hump-shaped technology shock (model-based derivation).

high positive General-Purpose Technology and Speculative Bubble Detection explosiveness of the fundamental price (local explosiveness during adoption)

The framework produces a list of testable empirical questions that we leave as open problems.

Statement in the paper that it derives testable empirical questions from the theoretical framework; no empirical tests are executed in the paper itself.

high positive Leverage Laws: A Per-Task Framework for Human-Agent Collabor... set of testable empirical research questions derived from the framework

The framework operationalizes aspects of earlier qualitative work on supervisory control (Sheridan, 1992), common ground (Clark & Brennan, 1991), and mixed-initiative interaction (Horvitz, 1999) within a single normative ratio.

Conceptual synthesis and mapping of prior qualitative literature into the new per-task leverage formalism presented in the paper; this is a theoretical linkage rather than empirical validation.

high positive Leverage Laws: A Per-Task Framework for Human-Agent Collabor... conceptual operationalization of supervisory control/common ground/mixed-initiat...

The per-task ceiling does not bind the windowed measure, though both remain bounded: L_task by per-task novelty, L_window by the stock of accumulated planning investment that pays out within the window.

Theoretical derivation/argument in the paper distinguishing bounds on per-task leverage (L_task) and windowed leverage (L_window) and identifying their respective limiting factors; no empirical evidence provided.

high positive Leverage Laws: A Per-Task Framework for Human-Agent Collabor... bounds on L_task and L_window (per-task novelty and accumulated planning investm...

We extend this per-task analysis to a windowed leverage measure that accommodates recurring tasks, spawned subtasks, and amortized system-design investment.

Conceptual/theoretical extension in the paper defining a windowed leverage metric and describing how it accounts for recurring tasks, subtasks, and amortized design investments; no empirical tests reported.

high positive Leverage Laws: A Per-Task Framework for Human-Agent Collabor... windowed leverage (aggregated leverage over a time window accounting for amortiz...

The asymptotic behavior of leverage decomposes into two scaling axes (capability and memory) with a non-zero floor on the planning term set by irreducible task novelty bounded by human throughput.

Mathematical/theoretical asymptotic analysis within the paper; conceptual derivation linking capability and memory as scaling axes and asserting a lower bound on planning cost due to task novelty and human throughput.

high positive Leverage Laws: A Per-Task Framework for Human-Agent Collabor... leverage scaling behavior and lower bound on planning term

Information density itself is directional and bounded by separate ceilings on human-to-agent and agent-to-human flow.

Theoretical argument/derivation in the paper establishing directional information-density and distinct upper bounds for each flow direction; no empirical validation reported.

high positive Leverage Laws: A Per-Task Framework for Human-Agent Collabor... directional information flow bounds between human and agent

The denominator decomposes into three channels through which a conserved per-task information requirement must flow, each with its own time-cost scalar (specify the task, resolve mid-run interrupts, and review the result).

Analytic decomposition within the paper's theoretical framework; conceptual argument rather than empirical measurement.

high positive Leverage Laws: A Per-Task Framework for Human-Agent Collabor... components of human time cost (specification, interrupt resolution, review)

« Prev 1 2 3 … 127 128 129 … 273 274 Next »