Evidence (8625 claims)
Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 761 | 200 | 101 | 904 | 2020 |
| Governance & Regulation | 829 | 400 | 191 | 122 | 1566 |
| Organizational Efficiency | 784 | 193 | 125 | 84 | 1197 |
| Technology Adoption Rate | 637 | 236 | 124 | 97 | 1103 |
| Research Productivity | 431 | 131 | 58 | 340 | 972 |
| Output Quality | 481 | 183 | 59 | 47 | 770 |
| Decision Quality | 332 | 177 | 82 | 49 | 647 |
| Firm Productivity | 439 | 57 | 88 | 20 | 610 |
| AI Safety & Ethics | 218 | 279 | 66 | 33 | 602 |
| Market Structure | 181 | 170 | 123 | 24 | 503 |
| Task Allocation | 214 | 64 | 72 | 33 | 388 |
| Skill Acquisition | 174 | 62 | 62 | 17 | 315 |
| Innovation Output | 204 | 27 | 45 | 18 | 295 |
| Employment Level | 105 | 54 | 108 | 13 | 282 |
| Fiscal & Macroeconomic | 132 | 69 | 43 | 26 | 277 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 154 | 48 | 26 | 3 | 231 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 123 | 50 | 6 | 223 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 71 | 92 | 10 | 2 | 175 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 58 | 56 | 26 | 13 | 156 |
| Training Effectiveness | 96 | 21 | 14 | 19 | 152 |
| Wages & Compensation | 77 | 37 | 25 | 6 | 145 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 81 | 21 | 1 | 115 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 47 | 6 | 1 | 59 |
| Social Protection | 28 | 16 | 8 | 2 | 54 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Adoption
Remove filter
Cluster reliability should be validated (e.g., bootstrap, perturbations) and automatic labels complemented with expert human validation for critical analyses.
Caveat and recommended validation steps provided in summary; suggests bootstrap/perturbation and manual validation as best practices. No empirical stability metrics provided in summary.
Results are sensitive to model and prompt choice; researchers should perform robustness checks across LLMs, soft prompts, and embedding models.
Caveat explicitly stated in the paper summary noting model and prompt sensitivity; recommended validation steps include robustness checks across models and prompts.
Empirical validation is concentrated on the Agora-12 corpus; generalizability to other architectures, scales, or deployment contexts is unproven and identified as a limitation.
Authors' own limitations section and scope of empirical tests (analyses limited to Agora-12 and four clinical cases).
Higher complaint volume is significantly associated with near-term stock price declines.
Fixed-effects panel path models estimated on monthly data for 261 financial firms (2018–2023) report statistically significant negative associations between firm–month complaint volume and subsequent abnormal returns.
Consumer complaints—measured by monthly volume, topic composition, and VADER sentiment of complaint narratives—contain behavioral signals that predict short-term abnormal stock returns in U.S. financial firms.
CFPB complaint records matched to 261 publicly traded U.S. financial firms (monthly observations, 2018–2023); analyses use fixed-effects panel path models to link firm–month complaint features (volume, LDA topic prevalences, aggregated VADER sentiment) to firm-level abnormal returns; complementary machine-learning models evaluate out-of-sample predictive performance.
Measurement issues (task-based output measurement, attributing output changes to AI) and selection into early adoption bias estimated productivity gains upward.
Methodological robustness checks reported in the paper: task-based measures, bounding exercises, placebo tests, and analysis of pre-trends; discussions of selection on unobservables and potential upward bias.
Implementing the governed hyperautomation pattern raises upfront costs (governance tooling, monitoring, validation, compliance processes).
Economic and cost-structure discussion in the paper, based on qualitative reasoning and industry experience; no quantified cost estimates or sample-based cost analysis provided.
Use of standardized (non-adaptive) dialogues limits ecological validity relative to live adaptive chatbots.
Limitations section acknowledges that standardized (non-adaptive) experimental dialogues reduce ecological validity compared with live/adaptive chatbot interactions.
Platform KPIs (e.g., eCPM) can diverge from social welfare metrics (consumer surplus, privacy harms), creating metric misalignment.
Conceptual critique with examples of common platform metrics versus welfare economics; not accompanied by a quantitative comparison dataset.
Privacy constraints reduce observability and necessitate privacy-preserving study designs that complicate estimation.
Methodological analysis referencing differential privacy, federated learning and their effects on statistical power/observability; no experimental power analyses with sample sizes presented here.
Data access asymmetries (platforms holding proprietary logs) limit external auditability and replication of advertising research.
Empirical and institutional observation about industry data practices; supported by calls for privacy-preserving shared datasets in the paper; no quantified survey sample included.
Attribution complexity — multi-touch, cross-device, and delayed conversions — confounds causal inference in advertising measurement.
Methodological discussion referencing causal inference challenges and standard problems in attribution; widely-documented in the literature though not re-measured in this paper.
Complex automated systems make attribution and responsibility harder when harms occur (Automation vs accountability trade-off).
Qualitative institutional analysis and case-study reasoning about multi-agent automated pipelines and opaque model decisions; no single empirical incident dataset provided.
Richer personalization depends on granular data and cross-device identity, creating privacy externalities and compliance risks (Personalization vs privacy trade-off).
Data source inventory and privacy literature review; supported by observational industry trends (move to first-party identity) rather than a quantified sample in the paper.
The cost of formalizing informal labor (CFIL) implies formalizing a worker costs on average 88% more than the informal wage in 2023.
New CFIL metric calculated for 19 countries (2023 baseline) by estimating the additional employer cost of hiring and formalizing an informal worker and reporting it relative to the informal wage, using compiled statutory obligations and informal wage benchmarks.
There is sizable attrition in the pipeline from applicant admission through to direct employment of AI graduates, indicating leakages at multiple stages (application → admission → graduation → employment).
Quantification of human-resource losses across pipeline stages using the monitoring dataset for the 191 institutions; descriptive counts/percentages of entrants, admitted students, graduates, and those directly employed in AI roles (pipeline loss metrics reported in paper).
Graduates from Russian universities running AI-related educational programs together with alternative training routes (self-education and professional retraining) satisfy 43.9% of estimated national AI personnel demand.
Monitoring dataset of 191 Russian universities implementing AI-related programs; aggregated counts of university graduates plus estimated contributions from self-education and professional retraining compared to an estimated national AI personnel demand (coverage reported as 43.9%).
AI automates routine and some mid-skill tasks, reducing employment in those occupations.
Empirical task-based exposure measures mapping AI capabilities to occupational task content, microdata analyses of employment by occupation using household/employer/administrative datasets, and panel regressions/decompositions that document within-occupation declines and between-occupation shifts.
Relying on secondary literature limits the paper's ability to make causal inferences and constrains empirical generalizability to all sectors or countries.
Stated limitations in the paper's Data & Methods section acknowledging scope and inferential constraints.
Increases in K_T reduce employment levels in affected firms and industries even when aggregate productivity rises.
Panel econometric estimates at firm and industry levels relating K_T intensity to employment outcomes, controlling for demand, input prices, and firm characteristics; difference-in-differences specifications and instrumental-variable robustness checks; corroborated by sectoral case studies.
Rising technological capital (K_T) — proxied by robot/automation density, software and intangible capital accumulation, AI adoption surveys, and AI-related patenting — leads to a decline in labor’s share of output.
Firm- and industry-level panel regressions linking constructed K_T intensity measures to labor shares, supported by macro growth-accounting decompositions; robustness checks include difference-in-differences and instrumenting adoption with plausibly exogenous shocks (e.g., cross-border technology diffusion, trade shocks); validated with cross-country comparisons and case studies.
We evaluate the system in two controlled human-subject experiments comparing AI-based pre-mediation with professional human mediators in a multi-issue negotiation scenario.
Reported experimental design: two controlled human-subject studies conducted by the authors (details and comparisons described in paper).
The pipeline's components are not autonomous and do not interact peer-to-peer; outputs are passed forward in a fixed sequence (single-party pipeline).
Implementation detail described in the paper (system architecture specification).
ALE is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks.
Author-provided counts describing the benchmark taxonomy and task pool.
ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy).
Design specification described in the paper referencing O*NET / SOC 2018.
At inference time, BRANE selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining.
Method description and algorithmic claim in the paper (selection rule maximizing predicted correctness with cost penalty). No empirical sample size required for algorithmic description.
We propose BRANE, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly.
Method description in the paper: BRANE architecture and training procedure (LLM-based feature extraction + per-configuration correctness predictor). No numeric sample size reported for method description.
AI deployment should be evaluated not only by average task speed, but by its overall effects on congestion, rework, and the robustness of human oversight under load.
Policy/recommendation based on the paper's theoretical results and derived implications from the queueing model (conceptual/prescriptive conclusion; no empirical testing reported).
The divergence between mean task speed and system-level delay caused by AI assistance is labeled the 'variance wedge'.
Definition/terminology introduced in the paper as part of its conceptual framing; supported by the analytic model description.
GENSTRAT generates a distribution of two-player zero-sum imperfect-information card games.
Design specification in paper; reported generated pool size of 2,000 games (abstract).
Few benchmarks achieve widespread use (examples given include GPQA Diamond, LiveCodeBench, AIME 2025).
Empirical observation from the dataset showing that only a small number of benchmarks are highlighted across multiple builders/releases; specific named benchmarks are cited as relatively widely used.
We introduce a taxonomy organized by influence tier, corresponding to interventions on progressively more latent variables: product mentions, information framing, behavioral redirection, and long-term preference shaping.
Paper contribution: authors present a four-tier taxonomy as a conceptual framework; this is a descriptive/constructive claim about the content of the paper itself.
Regulatory technology is viewed as a governance arrangement that organizes relations between firms, banks, insurers, logistics actors, buyers, and regulators.
Conceptual framing developed through the interpretive synthesis of multiple literature streams in the paper.
Primary evaluation uses real SDK tool-use across nine models from three providers (N=30 per model), where models autonomously invoke a graph query tool and reason from results.
Experimental setup reported by authors: 9 models from 3 providers, with 30 trials per model using real SDK tool-use and autonomous graph queries.
Oracle Poisoning manipulates the data agents reason over, not their instructions, distinguishing it from prompt injection.
Theoretical distinction and definitional comparison made by the authors (conceptual argument in the paper).
A symmetric six-gate producer audit separates LLM-engineering failures (template collapse, refusal, internal-ID leakage) from genuine commercial steering.
Methodological claim describing a six-gate producer audit procedure in the paper to diagnose engineering failures vs. commercial steering.
Holding the traveler and bundle fixed, the steering delta is read off between a commission-aware prompt and a minimum-disclosure factual template (paired counterfactual).
Method description of the paired counterfactual experimental design used by TourMart.
We propose TourMart, an applied intelligent-system audit instrument for LLM-OTA commission governance, driven by two governance levers — lambda (gain on message-induced perception) and kappa (budget-normalized cap on how far the message can shift perceived welfare).
Methodological proposal described in paper: design of an audit instrument and two formal levers (lambda, kappa).
Online travel agents (Booking, Trip.com, Expedia) have replaced ranked-list interfaces with conversational LLM agents that compress many options into one sentence of advice.
Descriptive assertion in paper about product/industry UI change; no empirical sample or formal measurement reported in excerpt.
We construct an evaluation framework on five function-calling benchmarks and train a DistilBERT-based classifier, deployed under a latency budget.
Methods / experimental setup reported in the paper: five function-calling benchmarks and a DistilBERT classifier trained and deployed under latency constraints.
We show that ρ ≥ 1 is the no-excess-crowding parity condition and connect Δ to an adoption game with exposure-dependent redundancy costs.
Theoretical result derived in the paper linking the human-relative diversity ratio ρ to a parity condition and relating the excess-crowding coefficient Δ to an adoption-game model with exposure-dependent redundancy costs.
The paper's contribution is a reproducible failure-and-repair recipe for agentic systems where scalar rewards are easy to game and the intended behavior is only visible in traces (not a new optimizer or a hotel-pricing leaderboard).
Authors' framing and explicit statements of intended contribution; supported by the failure diagnosis, diagnostic protocol, and Trace-Prior RL repair demonstrated in simulator experiments.
We position DAO-governed decentralized physical infrastructure networks (DePIN) within a vertically integrated stack that links energy and sensing to connectivity, storage/compute, models, and robots.
Architectural/framework description in the paper that maps DePIN elements into a vertically integrated stack; conceptual/mapping method without empirical measurement.
The same curated index is exposed as a Gosset MCP server that any frontier model can call as a tool.
System description in paper noting that the curated index is available via a Gosset MCP server for external models to call.
All five systems receive the same natural-language query and the same JSON output schema.
Methodological detail reported in paper describing controlled inputs across systems.
We benchmark Gosset ... against four frontier systems with web access (Claude Opus 4.7, GPT 5.5, Gemini 3.1 Pro, Perplexity sonar-pro) on ten niche oncology/immunology targets.
Experimental benchmark described in paper: direct comparison of Gosset versus four named models on 10 targets; methodological statement.
Weight-based memory generalizes by applying abstract rules to inputs never seen before.
Conceptual claim grounded in the paper's theoretical distinction between weight-based learning and retrieval; references Complementary Learning Systems theory; no empirical sample in abstract.
Retrieval generalizes by similarity to stored cases.
Conceptual claim stated in paper (distinction between retrieval-based and weight-based generalization); supported by theoretical characterization, not empirical data in abstract.
Many practical machine learning applications are online and sequential, meaning prior decisions inform future ones — a setting in which fairness challenges differ from standard supervised learning.
Background claim in the paper motivating the work; literature context and conceptual discussion rather than new empirical data.
The paper establishes a taxonomy of forgetting mechanisms: passive decay-based, active deletion-based, safety-triggered, and adaptive reinforcement-based.
Explicit taxonomy presented in paper (listed in abstract).