Evidence (14055 claims)
Adoption
8570 claims
Productivity
7631 claims
Governance
6869 claims
Human-AI Collaboration
6491 claims
Org Design
4175 claims
Innovation
4114 claims
Labor Markets
3566 claims
Skills & Training
2966 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 758 | 199 | 100 | 900 | 2007 |
| Governance & Regulation | 826 | 400 | 191 | 122 | 1563 |
| Organizational Efficiency | 777 | 193 | 124 | 84 | 1189 |
| Technology Adoption Rate | 635 | 233 | 124 | 97 | 1098 |
| Research Productivity | 422 | 128 | 57 | 336 | 954 |
| Output Quality | 476 | 179 | 59 | 47 | 761 |
| Decision Quality | 328 | 177 | 81 | 47 | 640 |
| Firm Productivity | 435 | 57 | 88 | 20 | 606 |
| AI Safety & Ethics | 218 | 277 | 65 | 33 | 599 |
| Market Structure | 180 | 170 | 123 | 24 | 502 |
| Task Allocation | 213 | 64 | 72 | 33 | 387 |
| Skill Acquisition | 170 | 61 | 61 | 17 | 309 |
| Innovation Output | 203 | 27 | 43 | 18 | 292 |
| Employment Level | 105 | 54 | 107 | 13 | 281 |
| Fiscal & Macroeconomic | 131 | 69 | 43 | 26 | 276 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 153 | 48 | 26 | 3 | 230 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 122 | 49 | 6 | 221 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 69 | 92 | 10 | 2 | 173 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 56 | 56 | 26 | 13 | 154 |
| Training Effectiveness | 94 | 21 | 13 | 19 | 149 |
| Wages & Compensation | 77 | 36 | 25 | 6 | 144 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 80 | 20 | 1 | 113 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 31 | 18 | 8 | 3 | 61 |
| Skill Obsolescence | 5 | 46 | 6 | 1 | 58 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Market concentration and network effects create platform power that may squeeze smaller providers, raise costs, or lock users into ecosystems.
Platform economics literature and case examples reviewed in the paper; conceptual and theoretical support with illustrative empirical instances from secondary sources.
Infrastructure gaps (connectivity, electricity, identity systems) limit who benefits from digital finance.
Cross-country and development literature synthesized in the paper highlighting correlations between infrastructure availability and digital finance uptake; no primary empirical analysis in the paper.
Measurement issues (task-based output measurement, attributing output changes to AI) and selection into early adoption bias estimated productivity gains upward.
Methodological robustness checks reported in the paper: task-based measures, bounding exercises, placebo tests, and analysis of pre-trends; discussions of selection on unobservables and potential upward bias.
Implementing the governed hyperautomation pattern raises upfront costs (governance tooling, monitoring, validation, compliance processes).
Economic and cost-structure discussion in the paper, based on qualitative reasoning and industry experience; no quantified cost estimates or sample-based cost analysis provided.
Use of standardized (non-adaptive) dialogues limits ecological validity relative to live adaptive chatbots.
Limitations section acknowledges that standardized (non-adaptive) experimental dialogues reduce ecological validity compared with live/adaptive chatbot interactions.
Platform KPIs (e.g., eCPM) can diverge from social welfare metrics (consumer surplus, privacy harms), creating metric misalignment.
Conceptual critique with examples of common platform metrics versus welfare economics; not accompanied by a quantitative comparison dataset.
Privacy constraints reduce observability and necessitate privacy-preserving study designs that complicate estimation.
Methodological analysis referencing differential privacy, federated learning and their effects on statistical power/observability; no experimental power analyses with sample sizes presented here.
Data access asymmetries (platforms holding proprietary logs) limit external auditability and replication of advertising research.
Empirical and institutional observation about industry data practices; supported by calls for privacy-preserving shared datasets in the paper; no quantified survey sample included.
Attribution complexity — multi-touch, cross-device, and delayed conversions — confounds causal inference in advertising measurement.
Methodological discussion referencing causal inference challenges and standard problems in attribution; widely-documented in the literature though not re-measured in this paper.
Complex automated systems make attribution and responsibility harder when harms occur (Automation vs accountability trade-off).
Qualitative institutional analysis and case-study reasoning about multi-agent automated pipelines and opaque model decisions; no single empirical incident dataset provided.
Richer personalization depends on granular data and cross-device identity, creating privacy externalities and compliance risks (Personalization vs privacy trade-off).
Data source inventory and privacy literature review; supported by observational industry trends (move to first-party identity) rather than a quantified sample in the paper.
Federated infrastructures introduce adversarial risks (model/data poisoning, inference attacks on updates) that require robust aggregation, anomaly detection, and other defenses.
Threat modeling and taxonomy of adversarial/privacy threats with mapped mitigations (robust aggregation, anomaly detection, DP). Evidence is conceptual and based on standard threat frameworks; no empirical attack/defense experiments reported at scale.
Delayed and sparse feedback (clicks/conversions) in advertising complicates credit assignment and timely model updates, degrading learning unless specific methods for delayed/sparse signals are used.
Analytical discussion of learning dynamics with delayed/sparse labels; conceptual solutions suggested (credit assignment methods). No large-scale empirical evaluation presented.
Non-IID and heterogeneous data distributions across devices and publishers impair convergence and degrade personalization unless addressed with algorithmic adaptations.
Analytical modeling of convergence under non-IID conditions; threat/robustness discussion; prototype/simulation illustrations. This claim is supported by established literature and the paper's analytic treatment.
The cost of formalizing informal labor (CFIL) implies formalizing a worker costs on average 88% more than the informal wage in 2023.
New CFIL metric calculated for 19 countries (2023 baseline) by estimating the additional employer cost of hiring and formalizing an informal worker and reporting it relative to the informal wage, using compiled statutory obligations and informal wage benchmarks.
VIS inherits the limitations of input–output assumptions (fixed coefficients, no price feedbacks); AI-driven structural change may violate those assumptions, so dynamic extensions or calibration are needed.
Paper explicitly cautions about input–output model limitations and the need for dynamic extensions/calibration under structural/technological change.
There is sizable attrition in the pipeline from applicant admission through to direct employment of AI graduates, indicating leakages at multiple stages (application → admission → graduation → employment).
Quantification of human-resource losses across pipeline stages using the monitoring dataset for the 191 institutions; descriptive counts/percentages of entrants, admitted students, graduates, and those directly employed in AI roles (pipeline loss metrics reported in paper).
Graduates from Russian universities running AI-related educational programs together with alternative training routes (self-education and professional retraining) satisfy 43.9% of estimated national AI personnel demand.
Monitoring dataset of 191 Russian universities implementing AI-related programs; aggregated counts of university graduates plus estimated contributions from self-education and professional retraining compared to an estimated national AI personnel demand (coverage reported as 43.9%).
AI automates routine and some mid-skill tasks, reducing employment in those occupations.
Empirical task-based exposure measures mapping AI capabilities to occupational task content, microdata analyses of employment by occupation using household/employer/administrative datasets, and panel regressions/decompositions that document within-occupation declines and between-occupation shifts.
Relying on secondary literature limits the paper's ability to make causal inferences and constrains empirical generalizability to all sectors or countries.
Stated limitations in the paper's Data & Methods section acknowledging scope and inferential constraints.
Increases in K_T reduce employment levels in affected firms and industries even when aggregate productivity rises.
Panel econometric estimates at firm and industry levels relating K_T intensity to employment outcomes, controlling for demand, input prices, and firm characteristics; difference-in-differences specifications and instrumental-variable robustness checks; corroborated by sectoral case studies.
Rising technological capital (K_T) — proxied by robot/automation density, software and intangible capital accumulation, AI adoption surveys, and AI-related patenting — leads to a decline in labor’s share of output.
Firm- and industry-level panel regressions linking constructed K_T intensity measures to labor shares, supported by macro growth-accounting decompositions; robustness checks include difference-in-differences and instrumenting adoption with plausibly exogenous shocks (e.g., cross-border technology diffusion, trade shocks); validated with cross-country comparisons and case studies.
Fuel subsidy reform imposed an enormous fiscal burden that peaked at 2.8% of GDP in 2022, limiting the macroeconomic leverage of AI-driven efficiency gains.
Reported fiscal statistic in the paper (2.8% of GDP in 2022) and its role in analysis of why AI savings do not translate into large macro gains.
The oil and gas trade balance remained in deficit at -1.55 billion USD in May 2025 and -1.58 billion USD in July 2025 despite an overall national trade surplus.
Reported trade-balance figures in the paper (monthly trade statistics for May and July 2025).
We discuss design tradeoffs, failure modes, and lessons learned from operating autonomous AI agents at scale.
Paper statement indicating inclusion of discussion sections on tradeoffs, failure modes, and operational lessons; descriptive/meta claim about paper content.
The core problem is not the absence of explanation but the absence of structured reasoning in the first place.
Conceptual argument/proposed reframing presented in the paper; no empirical test reported.
We ran a controlled three-arm ablation on a production valuation agent: A = plain web-only LLM analyst; B = adds public structured tools + a 14-dimension valuation playbook, verifier, objectivity policy and red-team; C = adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence.
Description of experimental arms and setup used in the study (methodological statement).
The evidence base was concentrated in system-facing applications that detect or shape inequities within recruitment, evaluation and exposure systems.
Synthesis result from the scoping review indicating thematic concentration across included studies (as reported in abstract).
ALE is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks.
Author-provided counts describing the benchmark taxonomy and task pool.
ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy).
Design specification described in the paper referencing O*NET / SOC 2018.
Agentic AI is best characterized as a continuum of autonomy and delegated authority, distinct from purely informational outputs and including systems capable of independently generating insured events through external actions.
Conceptual taxonomy and definitional argument presented in the paper distinguishing informational models from agentic systems with delegated authority; theoretical reasoning and classification.
We evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants.
Method description: evaluation of seven LLMs comparing zero-shot model estimates to self-reported skill ratings; 27 participants provided self-reports.
At inference time, BRANE selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining.
Method description and algorithmic claim in the paper (selection rule maximizing predicted correctness with cost penalty). No empirical sample size required for algorithmic description.
We propose BRANE, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly.
Method description in the paper: BRANE architecture and training procedure (LLM-based feature extraction + per-configuration correctness predictor). No numeric sample size reported for method description.
AI deployment should be evaluated not only by average task speed, but by its overall effects on congestion, rework, and the robustness of human oversight under load.
Policy/recommendation based on the paper's theoretical results and derived implications from the queueing model (conceptual/prescriptive conclusion; no empirical testing reported).
The divergence between mean task speed and system-level delay caused by AI assistance is labeled the 'variance wedge'.
Definition/terminology introduced in the paper as part of its conceptual framing; supported by the analytic model description.
The benchmark probes 18 mainstream LLMs across four prompting strategies.
Benchmark experiments described in the paper evaluate 18 mainstream LLMs using four different prompting strategies applied to the collected dataset.
Structured illustrations across document processing, legal services, audit, clinical decision support, and procurement discipline the boundary logic developed in the theory.
Methodological statement that the paper uses structured cross-domain illustrations to ground and discipline the theoretical claims; no empirical sample reported.
There are three accountability-boundary strategies in agentic ecosystems: component, integrated, and dual-track.
Theoretical categorization introduced by the authors as part of the capability-level theory; illustrated with cross-domain examples rather than empirical testing.
GENSTRAT generates a distribution of two-player zero-sum imperfect-information card games.
Design specification in paper; reported generated pool size of 2,000 games (abstract).
The study used standard scientific methods, employing a comparative approach and inductive and deductive methods to identify patterns of interaction between legal regulation and technological development.
Methodology section of the paper explicitly states the use of comparative, inductive and deductive methods and theoretical synthesis.
The paper develops a theoretical and legal model that treats law as an integral part of the economic system influencing income distribution, labour relations, market structure and productivity dynamics.
Model construction through synthesis of theoretical perspectives using inductive and deductive methods and comparative legal analysis (methodology described in the paper).
The paper provides a taxonomy of minimum input artifacts for agentic software, firmware, and hardware work; a conversation-to-contract gate; risk-adaptive workflows; and an evidence-bundle acceptance model for agent-generated artifacts.
Declared contributions in the paper (deliverables/artefacts produced by the research; no empirical validation provided in the abstract).
The central problem for agentic engineering is no longer prompt engineering; it is engineering process control.
Argument and synthesis presented by the paper (conceptual claim based on reviewed evidence).
The results define three operating regimes.
Summary claim in results/conclusions indicating categorization of outcomes into three regimes.
Few benchmarks achieve widespread use (examples given include GPQA Diamond, LiveCodeBench, AIME 2025).
Empirical observation from the dataset showing that only a small number of benchmarks are highlighted across multiple builders/releases; specific named benchmarks are cited as relatively widely used.
We performed a large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries.
Study design and reported sample sizes and model counts provided in the paper.
Experiments are run with and without access to Causely under two scenarios: an active incident and a healthy baseline.
Methodological description in the paper describing the two experimental conditions (with/without Causely) and two scenarios (active incident, healthy baseline).
Experiments compare four agent configurations (Claude Code, OpenAI Codex, HolmesGPT with Sonnet and Gemini backends).
Methodological description listing the four agent configurations used in experiments.
We evaluate this value proposition through a benchmark study conducted in a controlled setting with injected faults in a 24-microservice OpenTelemetry demo application.
Methodological description in the paper specifying a controlled benchmark with an OpenTelemetry demo application composed of 24 microservices.