Evidence (13661 claims)
Adoption
8339 claims
Productivity
7479 claims
Governance
6715 claims
Human-AI Collaboration
6267 claims
Org Design
4098 claims
Innovation
3987 claims
Labor Markets
3488 claims
Skills & Training
2888 claims
Inequality
2016 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 740 | 192 | 95 | 871 | 1945 |
| Governance & Regulation | 796 | 388 | 185 | 119 | 1512 |
| Organizational Efficiency | 765 | 186 | 123 | 82 | 1166 |
| Technology Adoption Rate | 610 | 227 | 121 | 95 | 1061 |
| Research Productivity | 409 | 121 | 56 | 331 | 928 |
| Output Quality | 464 | 174 | 58 | 47 | 743 |
| Decision Quality | 318 | 173 | 75 | 42 | 615 |
| Firm Productivity | 432 | 55 | 88 | 20 | 601 |
| AI Safety & Ethics | 214 | 273 | 65 | 33 | 589 |
| Market Structure | 175 | 165 | 120 | 24 | 489 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 161 | 57 | 57 | 16 | 291 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Fiscal & Macroeconomic | 130 | 69 | 43 | 26 | 275 |
| Employment Level | 104 | 50 | 105 | 13 | 274 |
| Consumer Welfare | 116 | 62 | 42 | 11 | 231 |
| Firm Revenue | 149 | 45 | 26 | 3 | 223 |
| Inequality Measures | 43 | 120 | 49 | 6 | 218 |
| Task Completion Time | 164 | 29 | 8 | 12 | 214 |
| Worker Satisfaction | 89 | 60 | 20 | 12 | 181 |
| Error Rate | 69 | 89 | 9 | 2 | 169 |
| Regulatory Compliance | 74 | 67 | 14 | 4 | 159 |
| Training Effectiveness | 91 | 19 | 13 | 19 | 144 |
| Wages & Compensation | 77 | 33 | 25 | 6 | 141 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Automation Exposure | 49 | 50 | 22 | 12 | 136 |
| Developer Productivity | 91 | 17 | 14 | 5 | 128 |
| Job Displacement | 12 | 80 | 19 | 1 | 112 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Skill Obsolescence | 5 | 43 | 6 | 1 | 55 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Commonly reported gains include the automation of trivial and repetitive tasks.
Multiple studies in the review report that LLM-assistants automate mundane programming tasks.
Commonly reported gains include minimized code search due to LLM assistance.
Synthesis of study findings noting reductions in developer time spent searching for code or answers.
Commonly reported gains from LLM-assistants include accelerated development (faster task completion).
Multiple included studies report faster development workflows and reduced time-to-complete tasks, as synthesized in the review.
The majority of reviewed studies report considerable benefits from LLM-assistants.
Synthesis of findings across the 39 included peer-reviewed studies as reported in the review.
Comparing the verbal-profile setting to a numeric-budget condition with confidentiality instructions cleanly isolates role coherence as distinct from instruction-following failure.
Experimental comparison between verbal-profile condition and numeric-budget condition with confidentiality instructions; result claimed to isolate mechanism (role coherence) from mere instruction-following failure.
In an experiment where a language-model buyer agent shops on behalf of a verbal consumer profile, seller-side inference from dialogue alone recovers willingness to pay nearly one-for-one.
Reported experimental result using a language-model buyer agent interacting on behalf of a verbal consumer profile; experimental comparison described in paper excerpt (specific sample size and statistical details not provided in the excerpt).
Consumers are increasingly delegating purchase decisions to AI agents, providing natural-language descriptions of their preferences and identity.
Asserted in paper's introduction/abstract as a background trend; no empirical sample or citation provided in the excerpt.
Capital-managing agents should be evaluated across the full path from user mandate to prompt, validated action, and settlement.
Recommendation based on the authors' empirical deployment and analysis of failure modes and mitigation effectiveness across the end-to-end pipeline.
Targeted harness changes increased capital deployment from 42.9% to 78.0% in an affected test population.
A/B or pre/post testing in an affected test population measuring percentage of capital deployed before and after harness changes.
Targeted harness changes reduced fee-led observations from 32.5% to below 10% in an affected test population.
A/B or pre/post testing in an affected test population measuring incidence of fee-led observations before and after harness changes.
Targeted harness changes reduced fabricated sell rules from 57% to 3% in an affected test population.
A/B or pre/post testing in an affected test population measuring incidence of fabricated sell-rule observations before and after harness changes (percentage rates reported).
Policy-valid submitted transactions settled with 99.9% success.
Settlement logs comparing policy-valid submitted transactions to successful onchain settlements.
Expert validation established strong relevance and practical utility for the framework, with a mean score of 4.6/5.
Structured validation exercise with five domain experts in AI ethics, corporate governance, and fintech regulation; paper reports the mean validation score as 4.6/5.
Analysis revealed four foundational governance pillars: Accountability, Transparency, Fairness, and Compliance.
Theme extraction from the SLR of 45 peer-reviewed publications (2022-2025) reported in the paper; these four pillars are presented as the core components of the proposed framework.
The study develops and validates an integrated conceptual framework that incorporates corporate governance principles with mechanisms for algorithmic fairness to foster ethical outcomes in SME fintech lending.
Two-phase research approach described in paper: (1) systematic literature review (45 peer-reviewed publications, 2022-2025) and (2) structured validation with five domain experts in AI ethics, corporate governance, and fintech regulation.
AI-driven credit assessment platforms promise greater efficiency in fintech lending.
Statement in paper (conceptual claim); supported by related literature cited in the SLR of 45 papers but no empirical efficiency metric reported in this paper.
The rapid growth of fintech lending has reshaped financial access for SMEs through AI-driven credit assessment platforms.
Assertion in paper's background; positioned as established context for study (no specific empirical estimate given). The paper's SLR (45 peer-reviewed publications, 2022-2025) is presented as the literature basis for context.
To a lesser extent, fears of AI automation drive demand for schemes that guarantee income regardless of employment status.
Findings from the 2024 OECD 'Risks that Matter' survey reported in the paper (survey-based measure of support for income-guarantee schemes conditional on fear of automation).
Rather than increasing support for traditional interventions such as unemployment benefits and training programs, these fears primarily drive demand for measures that preserve the social role of work and protect it from automation, such as robot taxes.
Results from the 2024 OECD 'Risks that Matter' public opinion survey analyzed in the paper (survey-based association between fear and policy preferences).
Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.
Author statement of intent to release code and assets upon paper acceptance.
The paper describes three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end.
Author statement listing three proposed reference architectures and noting they are not yet integrated end-to-end.
Factual accuracy on stated claims is 85.5%.
Reported accuracy measurement on 'stated claims' in generated outputs from systems evaluated on EnterpriseDocBench. Details on annotator process and sample size not included in excerpt.
Both hybrid and BM25 beat dense embedding (dense embedding nDCG@5 = 0.83).
Reported nDCG@5 values for three retrieval approaches on the benchmark (values quoted in paper).
Hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91).
Empirical evaluation on EnterpriseDocBench using nDCG@5 as reported metric in paper. Exact query count or folds not provided in the excerpt.
We ran three pipelines through it: BM25, dense embedding, and a hybrid, all using the same GPT-5 generator.
Method statement describing experimental pipelines evaluated on EnterpriseDocBench (three retrieval variants combined with a shared generator).
The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot).
Author description of corpus composition (number of domains and pilot coverage). No document-count supplied in provided text.
We built EnterpriseDocBench to evaluate parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness on the same corpus.
Description of dataset/benchmark creation and stated design goals in paper (author-developed benchmark covering four stages).
Most enterprise document AI today is a pipeline: parse, index, retrieve, generate.
Author assertion about prevailing architecture of enterprise document-AI systems (introductory observation in paper). No empirical sample size or systematic survey reported in text provided.
We release the full code base and a richly annotated dataset to support reproducible research on adaptive VCAs.
Paper statement announcing release of code and dataset.
The recommender achieved high relevance (MRR@1=0.75).
Reported offline/online recommender evaluation in the paper using Mean Reciprocal Rank at 1 (MRR@1) metric; presumably computed over recommendations in the study (711 conversations).
Step-by-step guidance improved pleasantness and reduced user burden.
User-reported measures collected in the controlled study (likely subjective ratings across participants/conversations).
Device-level evidence increased correct resolutions from about 50% to over 90% relative to an LLM-only baseline.
Controlled study comparing SecMate with device-level diagnostic evidence to an LLM-only baseline; reported results across 144 participants / 711 conversations.
Service specificity is achieved through a proactive, context-aware recommender.
System description and recommender component evaluation in the paper.
User specificity relies on implicit proficiency inference and profile-aware troubleshooting.
System design and algorithmic description in the paper explaining user-proficiency inference and profile-aware components.
Device specificity is provided by a lightweight local diagnostic utility.
System design and implementation details reported in the paper describing the diagnostic utility component.
We present SecMate, a multi-agent VCA for cybersecurity troubleshooting that integrates device, user, and service specificity from conversational and device-level signals.
System description and architecture presented in the paper (design and implementation of SecMate).
Make is most compelling for commodity utilities and for differentiating custom applications in the AI era.
Paper's typology and normative recommendation derived from conceptual analysis (no empirical validation reported).
AI fundamentally transforms the governance properties of the Make option, shifting it from Williamson's pure hierarchy to a hybrid governance form that combines code ownership with external AI infrastructure dependency.
Conceptual argument combining transaction cost economics, resource-based view, and assessment of AI infrastructure characteristics (no empirical testing reported).
The 'SaaSocalypse' narrative predicts that AI will render large segments of the Software-as-a-Service market obsolete by enabling firms to build software in-house at a fraction of historical cost.
Statement summarizing an extant narrative in industry and literature (paper cites/describes this narrative; no empirical test in the paper).
Advances in generative artificial intelligence, particularly agentic coding systems capable of autonomous software development, are disrupting the economics of the make-or-buy decision for enterprise applications.
Paper's conceptual analysis combining transaction cost economics, resource-based view, and assessment of current AI capabilities (no empirical sample reported).
Empirically, the decomposition confirms a speculative peak confined to December 1999–March 2000 in the dot-com episode.
Empirical application of the proposed decomposition and bubble test to historical asset price data covering the dot-com episode (data analysis reported in the paper).
A fundamental-versus-speculative decomposition that projects prices onto observable technology proxies and applies the bubble test to the residual corrects for the contamination.
Methodological proposal described in the paper; presented as a corrective procedure (projection of prices on technology proxies and testing residuals).
Embedding a hump-shaped technology shock in the Campbell-Shiller present-value model, the fundamental price becomes locally explosive during adoption.
Analytical/theoretical proof derived from the modified Campbell-Shiller present-value model with a hump-shaped technology shock (model-based derivation).
The framework produces a list of testable empirical questions that we leave as open problems.
Statement in the paper that it derives testable empirical questions from the theoretical framework; no empirical tests are executed in the paper itself.
The framework operationalizes aspects of earlier qualitative work on supervisory control (Sheridan, 1992), common ground (Clark & Brennan, 1991), and mixed-initiative interaction (Horvitz, 1999) within a single normative ratio.
Conceptual synthesis and mapping of prior qualitative literature into the new per-task leverage formalism presented in the paper; this is a theoretical linkage rather than empirical validation.
The per-task ceiling does not bind the windowed measure, though both remain bounded: L_task by per-task novelty, L_window by the stock of accumulated planning investment that pays out within the window.
Theoretical derivation/argument in the paper distinguishing bounds on per-task leverage (L_task) and windowed leverage (L_window) and identifying their respective limiting factors; no empirical evidence provided.
We extend this per-task analysis to a windowed leverage measure that accommodates recurring tasks, spawned subtasks, and amortized system-design investment.
Conceptual/theoretical extension in the paper defining a windowed leverage metric and describing how it accounts for recurring tasks, subtasks, and amortized design investments; no empirical tests reported.
The asymptotic behavior of leverage decomposes into two scaling axes (capability and memory) with a non-zero floor on the planning term set by irreducible task novelty bounded by human throughput.
Mathematical/theoretical asymptotic analysis within the paper; conceptual derivation linking capability and memory as scaling axes and asserting a lower bound on planning cost due to task novelty and human throughput.
Information density itself is directional and bounded by separate ceilings on human-to-agent and agent-to-human flow.
Theoretical argument/derivation in the paper establishing directional information-density and distinct upper bounds for each flow direction; no empirical validation reported.
The denominator decomposes into three channels through which a conserved per-task information requirement must flow, each with its own time-cost scalar (specify the task, resolve mid-run interrupts, and review the result).
Analytic decomposition within the paper's theoretical framework; conceptual argument rather than empirical measurement.