Evidence (7953 claims)
Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 402 | 112 | 67 | 480 | 1076 |
| Governance & Regulation | 402 | 192 | 122 | 62 | 790 |
| Research Productivity | 249 | 98 | 34 | 311 | 697 |
| Organizational Efficiency | 395 | 95 | 70 | 40 | 603 |
| Technology Adoption Rate | 321 | 126 | 73 | 39 | 564 |
| Firm Productivity | 306 | 39 | 70 | 12 | 432 |
| Output Quality | 256 | 66 | 25 | 28 | 375 |
| AI Safety & Ethics | 116 | 177 | 44 | 24 | 363 |
| Market Structure | 107 | 128 | 85 | 14 | 339 |
| Decision Quality | 177 | 76 | 38 | 20 | 315 |
| Fiscal & Macroeconomic | 89 | 58 | 33 | 22 | 209 |
| Employment Level | 77 | 34 | 80 | 9 | 202 |
| Skill Acquisition | 92 | 33 | 40 | 9 | 174 |
| Innovation Output | 120 | 12 | 23 | 12 | 168 |
| Firm Revenue | 98 | 34 | 22 | — | 154 |
| Consumer Welfare | 73 | 31 | 37 | 7 | 148 |
| Task Allocation | 84 | 16 | 33 | 7 | 140 |
| Inequality Measures | 25 | 77 | 32 | 5 | 139 |
| Regulatory Compliance | 54 | 63 | 13 | 3 | 133 |
| Error Rate | 44 | 51 | 6 | — | 101 |
| Task Completion Time | 88 | 5 | 4 | 3 | 100 |
| Training Effectiveness | 58 | 12 | 12 | 16 | 99 |
| Worker Satisfaction | 47 | 32 | 11 | 7 | 97 |
| Wages & Compensation | 53 | 15 | 20 | 5 | 93 |
| Team Performance | 47 | 12 | 15 | 7 | 82 |
| Automation Exposure | 24 | 22 | 9 | 6 | 62 |
| Job Displacement | 6 | 38 | 13 | — | 57 |
| Hiring & Recruitment | 41 | 4 | 6 | 3 | 54 |
| Developer Productivity | 34 | 4 | 3 | 1 | 42 |
| Social Protection | 22 | 10 | 6 | 2 | 40 |
| Creative Output | 16 | 7 | 5 | 1 | 29 |
| Labor Share of Income | 12 | 5 | 9 | — | 26 |
| Skill Obsolescence | 3 | 20 | 2 | — | 25 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
Strict oversight requirements for GLAI could raise fixed compliance costs (audit, certification, human-in-the-loop processes), benefiting incumbent firms and potentially reducing competition and barriers to entry.
Regulatory economics argument drawing on compliance-cost logic and market structure effects; no empirical entry-cost analysis or case studies.
Perception of increased legal risk and regulatory uncertainty may slow adoption of GLAI and redirect investment toward safer subfields (verification tools, retrieval-augmented systems, formal-reasoning hybrids).
Economic reasoning and market-design argumentation based on risk/uncertainty dynamics; no econometric or survey data presented.
Divergent regulatory regimes (e.g., strict EU rules vs. looser regimes elsewhere) may produce regulatory arbitrage, influencing where GLAI companies locate, invest, and trade internationally.
Cross-jurisdictional regulatory analysis and economic inference about firm behavior under differential regulation; no firm-level relocation data provided.
The positive macroeconomic effects of AI are severely limited by structural issues, notably large petroleum import volumes and the fiscal burden of incomplete fuel subsidy reforms.
Integrated quantitative analysis showing that operational savings are outweighed by import volumes and subsidy fiscal costs; contextual fiscal data cited (fuel subsidy reform peak).
Evaluations that measure outcomes only via official-language channels risk underestimating impacts where vernacular mediation is central.
Argument based on the discrepancy between vernacular-mediated comprehension/adoption observed in the sample and the likely invisibility of those effects in official-language measurement channels; supported by questionnaire and qualitative data.
DPPs raise privacy and surveillance risks if personal data are linked to product use; economic regulation should incentivize privacy-preserving analytics (e.g., federated learning, differential privacy) and data minimality to maintain trust.
Risk assessment and governance recommendation grounded in stakeholder concerns and standard privacy literature; not empirically measured in the surveys.
Identified concrete training gaps in current models: delegation, scoped execution, and mode switching are skills absent from current training data and limit splitting models into manager/worker roles.
Authors' diagnosis based on experimental outcomes and qualitative reasoning about model training distributions; recommendation for future training focus.
Automated benchmarks dominate the evaluation of large language models, yet no systematic study has compared user satisfaction, adoption motivations, and frustrations across competing platforms using a consistent instrument.
Statement of the paper's motivation/background; implied literature review and identification of an empirical gap (no systematic, cross-platform user survey reported prior).
Interpretive, ad-hoc human-centered evaluation practices (e.g., “vibe checks”, team sense-making) are rational adaptations to LLM behavior rather than merely sloppy or inferior methodological choices.
Authors' interpretive argument based on interview evidence where practitioners explained why such practices persist and how they serve sense-making for unpredictable model behavior.
The possibility of strategic argument construction (gaming) motivates governance needs: standards for provenance, certification, and liability rules.
Policy recommendation based on anticipated incentive problems; no empirical governance evaluations.
Standard GDP statistics can mask AI-driven demand shortfalls; central banks and statistical agencies should therefore monitor labor-share–velocity links, distributional income measures, and consumption by income quantile in addition to headline GDP.
Theoretical Ghost GDP channel and calibration results showing divergence between measured GDP and consumption-relevant income; policy recommendation follows from those model results.
Health technology assessment (HTA) frameworks should be adapted to evaluate models trained on synthetic or hybrid data, incorporating metrics for fidelity, domain generalization, and economic impact (cost-effectiveness, budget impact, distributional effects).
Recommendation from the review synthesizing HTA literature and gaps identified when applying existing HTA to AI models trained on non-traditional data sources; based on policy analysis rather than empirical HTA trials of synthetic-data models.
Technical fixes alone are insufficient: governance, validation pipelines (e.g., health technology assessment), and capacity building are needed for safe, effective uptake of synthetic-data–trained AI.
Cross-disciplinary synthesis of governance analyses, health technology assessment literature, and implementation studies in the review arguing for combined technical and institutional interventions; recommendation-based evidence rather than new empirical trials.
AI changes the nature of capital (digital/algorithmic assets) and complicates productivity accounting; researchers should decompose firm-level productivity gains into AI technology, complementary organizational capital, and human capital effects.
Theoretical proposal grounded in productivity accounting literature and conceptual discussion; no single decomposition empirical result presented.
Conventional productivity statistics and standard evaluation methods may undercount benefits from conversational initiation assistance; new survey and administrative measures might be needed.
Policy and measurement recommendation based on the conceptual model; no empirical measurement validation provided.
Policy and governance issues become salient: liability, IP, security, and certification of AI-generated code require new standards for provenance, testing, and accountability.
Argument based on practitioner-raised concerns about security, IP, and provenance in the Netlight study; authors recommend policy attention; no legal/regulatory analysis or empirical policy evaluation provided.
Time-series metrics (e.g., derivatives like d/dt(student enrollment)) are useful monitoring signals for validation and system oversight.
Methodological suggestion in the paper proposing time-series analysis of enrollment and other administrative data; no empirical demonstration or threshold criteria provided.
Rehabilitation was the most common research area with 336 publications (~18.35%), followed by Pediatrics (reported as 1,387 publications in the text).
Results: WoS research area counts provided in the paper (listed values for Rehabilitation and Pediatrics).
Five interaction mechanisms were identified, with the majority propagating across the subsystem boundary.
Authors' thematic analysis and STS mapping identifying five cross- or within-subsystem interaction mechanisms; qualitative assessment that most propagate across subsystem boundary.
Undercontrolled workers exhibited minimal effects despite engaging with the frameworks.
Reported experimental observation: the undercontrolled cluster showed little to no measurable benefit from any of the interventions, despite engagement with the coaching frameworks.
The operative risk for legislators is not stable ideological bias in LLMs but contextual ignorance shaped by training data coverage.
Authors argue from observed model behavior on the 15 proposals (good performance on well-covered standardized templates; failures on idiosyncratic items) and interpret this as evidence that errors are driven by training-data coverage rather than consistent ideological bias.
Most action tools support medium-stakes tasks like editing files.
Classification of action tools by task consequentiality using O*NET mapping and inspection of tool functions (paper states majority are medium-stakes, e.g., file editing).
Mobile penetration reaches 84% (in the context of low-income countries), a statistic used to motivate RSI's potential reach.
Single numeric statistic reported in the paper as background context; source or empirical basis for the statistic not provided within the supplied text.
Many AI-assisted decision systems operate in competitive settings (e.g., admission or hiring) where only a fraction of candidates can succeed.
Authors' characterization of real-world contexts motivating the study (literature-based/contextual claim within the paper).
The authors assess system performance on JobSearch-XS across retrieval tasks.
Paper states that system performance is assessed on JobSearch-XS across retrieval tasks. The excerpt does not provide the tasks, metrics, sample sizes, or numerical results.
International shipping produces approximately 3% of global greenhouse gas emissions.
Contextual statement in the paper citing external estimates (specific source not provided in the excerpt).
Output quality saturates at approximately seven governed memories per entity.
Empirical analysis reported in the controlled experiments showing output quality vs. number of governed memories per entity, with saturation near seven memories.
The risk of endogeneity was avoided by using an instrumental approach to obtain causal estimates of the impact of technological diffusion on market opportunities.
Paper reports use of an instrumental variables approach to address endogeneity (instruments and diagnostics not described in the excerpt).
CAFTA spillovers stabilized import volumes from third countries (reduced volatility) for Chinese agricultural imports.
Analysis of import volume volatility metrics over 2000–2014 using customs data within DID framework; volatility/variance decline identified as an outcome in the mechanisms/secondary channel tests.
The report provides scenario-based forecasts for HACCA emergence across near-, mid-, and long-term timelines, identifying capability thresholds to monitor.
Capability trajectory assessment combining trends in AI capabilities, automation of software tasks, computation availability, and diffusion dynamics; scenario and expert-judgment approach (qualitative forecasting).
An interpretable logistic-regression model, calibrated with isotonic regression, produces well-calibrated, individual-level attrition probabilities suitable for policy simulation.
Modeling pipeline: logistic regression for prediction, isotonic regression for calibration; authors report strong predictive performance and well-calibrated probabilities (specific performance metrics not included in the provided summary).
A Sankey diagram of thematic evolution shows lexical convergence over time and indicates that a small set of authors has disproportionate influence in structuring the discourse.
Thematic evolution analysis visualized with a Sankey diagram; author influence inferred from performance trends (citations/publication counts) in the bibliometric data.
CID does not significantly mediate the relationship between SCD and strategic green innovation.
Mediation tests showing that while CID is related to substantive innovation, the indirect effect via CID on strategic green innovation was statistically insignificant.
This paper is one of the first systematic reviews focused specifically on NLP in bank marketing, organizing findings along the customer journey and the marketing mix to provide a practical taxonomy.
Authors' stated novelty claim based on the scoped literature search (2014–2024) and topical focus; novelty inferred from the small number of prior papers identified at the intersection.
There is a need to develop new trade statistics that capture AI‑enabled services and platform‑mediated cross‑border transactions.
Methodological gap identified across reviewed literature and statistical analyses; recommendation based on descriptive assessment (no development of such statistics in the paper).
Productivity gains from AI may be under- or mis-measured if national accounts and tax systems do not adjust for AI-driven quality changes in services.
Analytic observation in the paper's measurement and externalities discussion; not empirically tested within the study.
Distributed agency (Problem C) complicates classical principal–agent models; economists should develop models that capture multiple, overlapping agents and ambiguous attribution of outcomes.
Conceptual implication for economic modeling derived from the paper’s diagnosis of distributed agency; recommendation for formal modeling and simulations but none provided.
The paper documents production failure vignettes and operational lessons drawn from a real enterprise deployment integrated with a major cloud provider's MCP servers (client redacted).
Paper states empirical context is field lessons from an enterprise agent platform; failure vignettes are enumerated as deliverables.
ToM alignment matters less (i.e., misalignment has smaller effect) in settings with explicit coordination protocols, strong signaling, or standardized conventions.
Analyses and experiments described in the paper showing smaller performance differences between matched and mismatched ToM orders when explicit conventions or reliable signals are available; reported as part of robustness/conditional analyses.
Manipulating costs and benefits of observation versus action in experiments can probe the switching behavior driven by System M.
Proposed experimental manipulation; no empirical data presented.
Ablation studies disabling System M or decoupling Systems A and B will help test whether meta-control provides empirical benefits.
Suggested experimental design (ablation study) in the methods section; no results provided.
The authors will publicly release the benchmark, code, and pre-trained models.
Statement in the paper (release/availability section) announcing plans to publish benchmark, code, and pre-trained models.
Expert (per-expert) sizes and overall design are positioned between the GPT-OSS and Qwen3 MoE designs.
Architectural comparison asserted in the paper; claim is based on relative model-design choices (expert count/size) compared to public descriptions of GPT-OSS and Qwen3. The summary provides the positioning but not detailed layer-by-layer comparisons.
An orchestrator coordinates components with intent-aware routing and layered safety checks, enabling multi-step workflows and productized services.
Paper describes an agentic tool-calling framework and multi-layer orchestrator used for intent-aware routing, defense-in-depth safety validation, and multi-step workflows.
Aura is a long-form ASR system capable of handling hours-long audio.
Paper lists Aura in the product stack as 'long-form ASR handling hours-long audio.' Specific evaluation metrics or training data for ASR are not provided in the summary.
Arabic content comprises only about 0.5% of web data despite roughly 400 million native speakers.
Paper cites this data-point to motivate intentional data strategies for Arabic underrepresentation on the web; exact source of the web-proportion not specified in the summary.
Methods among the surveyed systems span token-level code generation to circuit-structure generation, and evaluation metrics are often task- and artifact-specific.
Surveyed system descriptions show diversity in generative approaches (token-level language models, graph/diffusion-based circuit generators, agentic optimizers) and corresponding tailored metrics noted in the review.
Overall employment in Albania has not fallen sharply; instead, changes are concentrated within occupational groups (i.e., occupational restructuring).
Official labor market statistics analyzed descriptively over the recent period, complemented by business survey and case-study evidence of within-occupation shifts. No causal identification; sample details not provided.
AI adoption in Albania is driving occupational restructuring rather than producing large net job losses.
Descriptive analysis of official labor market statistics, business surveys, and selected firm case studies comparing employment levels and occupational composition over the recent period; study notes limited causal identification. Sample size not specified in summary.
The study is the first empirical investigation of human–AI assistance in a live CTF setting with a direct comparison to autonomous AI agents on the same fresh challenges.
Authors' positioning of their work as novel; methodology involved a live onsite CTF, instrumentation of human–AI interactions (41 participants), and direct benchmarking of four autonomous agents on the same fresh challenge set.