Evidence (13661 claims)
Adoption
8339 claims
Productivity
7479 claims
Governance
6715 claims
Human-AI Collaboration
6267 claims
Org Design
4098 claims
Innovation
3987 claims
Labor Markets
3488 claims
Skills & Training
2888 claims
Inequality
2016 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 740 | 192 | 95 | 871 | 1945 |
| Governance & Regulation | 796 | 388 | 185 | 119 | 1512 |
| Organizational Efficiency | 765 | 186 | 123 | 82 | 1166 |
| Technology Adoption Rate | 610 | 227 | 121 | 95 | 1061 |
| Research Productivity | 409 | 121 | 56 | 331 | 928 |
| Output Quality | 464 | 174 | 58 | 47 | 743 |
| Decision Quality | 318 | 173 | 75 | 42 | 615 |
| Firm Productivity | 432 | 55 | 88 | 20 | 601 |
| AI Safety & Ethics | 214 | 273 | 65 | 33 | 589 |
| Market Structure | 175 | 165 | 120 | 24 | 489 |
| Task Allocation | 206 | 64 | 70 | 31 | 376 |
| Skill Acquisition | 161 | 57 | 57 | 16 | 291 |
| Innovation Output | 201 | 27 | 41 | 18 | 288 |
| Fiscal & Macroeconomic | 130 | 69 | 43 | 26 | 275 |
| Employment Level | 104 | 50 | 105 | 13 | 274 |
| Consumer Welfare | 116 | 62 | 42 | 11 | 231 |
| Firm Revenue | 149 | 45 | 26 | 3 | 223 |
| Inequality Measures | 43 | 120 | 49 | 6 | 218 |
| Task Completion Time | 164 | 29 | 8 | 12 | 214 |
| Worker Satisfaction | 89 | 60 | 20 | 12 | 181 |
| Error Rate | 69 | 89 | 9 | 2 | 169 |
| Regulatory Compliance | 74 | 67 | 14 | 4 | 159 |
| Training Effectiveness | 91 | 19 | 13 | 19 | 144 |
| Wages & Compensation | 77 | 33 | 25 | 6 | 141 |
| Team Performance | 86 | 17 | 27 | 9 | 140 |
| Automation Exposure | 49 | 50 | 22 | 12 | 136 |
| Developer Productivity | 91 | 17 | 14 | 5 | 128 |
| Job Displacement | 12 | 80 | 19 | 1 | 112 |
| Hiring & Recruitment | 51 | 7 | 8 | 3 | 69 |
| Creative Output | 31 | 16 | 7 | 2 | 57 |
| Skill Obsolescence | 5 | 43 | 6 | 1 | 55 |
| Social Protection | 27 | 16 | 8 | 2 | 53 |
| Labor Share of Income | 17 | 17 | 17 | — | 51 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
In a user study where 12 participants created slide decks, MindTrellis outperformed retrieval-only baselines in knowledge organization and cognitive load, as measured by expert ratings of content coverage and structural quality.
Controlled user study reported in the paper: N = 12 participants performing slide-deck creation tasks; outcomes assessed via expert ratings of content coverage and structural quality (comparison to retrieval-only baseline).
MindTrellis is an interactive visual system where users and AI collaboratively build a dynamic knowledge graph; users can query the graph for document-grounded information and contribute by introducing new concepts, modifying relationships, and reorganizing the hierarchy.
System design and implementation described in the paper (feature description and demonstration).
The framework, all collected signals, scoring outputs, and evaluation harness are released under CC BY 4.0.
Statement of data and code release policy in the paper.
AgentPulse surfaces deployment signal absent from benchmarks; it is a methodology, not a ground-truth ranking.
Conceptual and empirical argument in the paper supported by the analyses described (correlations with external adoption proxies and divergence from benchmark-only rankings).
The Benchmark+Sentiment sub-composite correlates with VS Code installs (ρ_s=0.44, p<0.05), reported as illustrative given that only 11 of 35 agents have non-zero installs.
Spearman correlation between Benchmark+Sentiment sub-composite and VS Code installs on the 35-agent sample, with a caveat that installs are non-zero for only 11 agents; reported correlation and p-value.
The Benchmark+Sentiment sub-composite predicts Stack Overflow question volume (ρ_s=0.49, p<0.01) in the circularity-controlled test (n=35).
Circularity-controlled Spearman correlation between Benchmark+Sentiment sub-composite and Stack Overflow question volume on 35 agents; reported correlation and p-value.
A circularity-controlled test (n=35) shows the Benchmark+Sentiment sub-composite, which contains no GitHub-derived signals, predicts external adoption proxies it does not aggregate: GitHub stars (ρ_s=0.52, p<0.01).
Circularity-controlled correlation test (Spearman) between Benchmark+Sentiment sub-composite and GitHub stars on a 35-agent sample; reported Spearman correlation and p-value.
We introduce AgentPulse, a continuous evaluation framework scoring 50 agents across 10 workload categories along four factors (Benchmark Performance, Adoption Signals, Community Sentiment, and Ecosystem Health) aggregated from 18 real-time signals across GitHub, package registries, IDE marketplaces, social platforms, and benchmark leaderboards.
Methodological description in the paper; reported sample of 50 agents and use of 18 signals from enumerated sources.
A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration.
Follow-up experimental intervention reported in the paper: augmenting model context with prior capability information and measuring calibration change.
Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly.
Conceptual/theoretical argument presented in the paper (no empirical test reported in the excerpt).
These results provide concrete tier-selection guidance across deployment scales from a single seminar to a university-wide rollout.
Concluding claim in paper based on empirical latency and cost comparisons across throughput tiers and concurrency up to 50 users from a live deployment.
Provisioned Throughput, expensive under continuous provisioning, becomes cost-competitive for institutions that can predict and concentrate their traffic toward high utilization.
Cost modeling and comparison across provisioning modes in the paper showing trade-offs conditional on utilization; based on the instrumented tiers and assumed usage patterns.
Cost analysis places both pay-per-token tiers well below the price of a STEM textbook per student per semester under a worst-case usage ceiling.
Cost analysis presented in the paper comparing per-student per-semester costs under a stated worst-case usage ceiling to the price of a STEM textbook; exact numeric assumptions not provided in the excerpt.
Priority PayGo maintains flat sub-4-second response times across the full load range.
Empirical latency measurements from the ITAS instrumentation described in the paper (over 3,000 requests across concurrency levels up to 50 and three throughput tiers).
Multi-agent LLM tutoring systems improve response quality through agent specialization.
Statement in paper describing design rationale; no quantitative quality comparison or metrics provided in the excerpt.
LLMs reveal their ability to approximate adaptive equilibria beyond static mechanism design.
Interpretive claim based on simulation results showing LLM bidders adapt and achieve favorable outcomes when mechanism assumptions fail (e.g., static budgets). This is an inferential claim drawn from comparative experiments described in the paper; no numerical quantification provided in the excerpt.
When theoretical assumptions break—such as under static budget constraints—LLMs sustain longer participation and achieve higher utilities.
Reported simulation results under scenarios violating VCG truthfulness assumptions (example: static budget constraints). The paper states LLM bidders maintained participation for longer and obtained higher utility than truthful and heuristic strategies in these scenarios. No numeric sample sizes or quantified effect sizes provided in the abstract.
Unlike heuristics, LLMs leverage historical outcomes and prompt-based reasoning to adapt their bidding behavior dynamically.
Method description and reported behavioral difference: the paper states LLM bidders incorporate prior auction outcomes and prompt engineering to inform bids, contrasted with static heuristic strategies. Based on simulation implementation rather than field deployment; no sample size provided.
Generative artificial intelligence (genAI) is rapidly reshaping how knowledge and culture are produced and consumed.
Author's descriptive statement based on observed changes in production/consumption patterns (no empirical sample reported in paper abstract).
The paper provides a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories.
Methodological contribution described in the paper (framework/overview component).
The FETS benchmark collects and analyzes 54 datasets across 9 data categories guided by typical stakeholder interests.
Descriptive claim about the benchmark dataset compilation reported in the paper (dataset count and category count).
Foundation models show improved performance at higher aggregation levels such as national load, district heating, and power grid data.
Subset analyses within the benchmark comparing performance across aggregation levels (e.g., national load, district heating, power grid) showing better results for aggregated data.
Foundation models outperform classical machine learning approaches despite the latter having seen the full historic target data during training.
Benchmark setup where classical ML baselines had access to full historic target data during training while foundation models were pretrained/generalized; empirical comparison across the dataset collection.
Covariate-informed foundation models achieve the strongest performance.
Benchmark experiments that compare foundation model variants, including those that incorporate covariates, across the collected datasets; reported as a comparative finding in the benchmark results.
Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories.
Empirical benchmark comparing foundation models vs classical dataset-specific ML approaches across multiple forecasting settings and data categories; reported analysis uses the collection of datasets assembled for the FETS benchmark.
This paper presents the first systematic study of token consumption patterns in agentic coding tasks, analyzing trajectories from eight frontier LLMs on SWE-bench Verified and evaluating models' ability to predict their own token costs before task execution.
Stated scope and methodology in the paper: dataset is SWE-bench Verified, eight frontier LLMs were analyzed, and experiments included model self-prediction evaluation.
Technological capability (AI) and board diversity are complementary in strengthening corporate governance and fiscal discipline in developing economies.
Synthesis/interpretation of empirical results (main effects and interaction effects) from panel regressions and robustness analyses on 1,586 firms across 2009–2023.
The main findings are robust to alternative tax avoidance measures, alternative BGD specifications, heterogeneity analyses, and selection-bias corrections (Heckman, propensity score matching, and instrumental-variable 2SLS approaches).
Reported robustness checks in the paper applying multiple alternative variable specifications and methods for selection-bias correction on the primary sample.
AI capability significantly strengthens the relationship between BGD and effective tax rates; firms with higher AI adoption exhibit a stronger governance effect of gender-diverse boards on tax compliance.
Interaction models estimated on the same balanced panel (1,586 firms, 2009–2023) using lagged AI capability specification; estimated with firm FE and dynamic two-step System GMM, with reported statistically significant interaction effects.
Board gender diversity (BGD) is positively associated with effective tax rates, implying lower levels of corporate tax avoidance.
Empirical analysis using a balanced panel of 1,586 non-financial firms from developing economies over 2009–2023; firm fixed effects models and dynamic two-step System GMM estimations used to address unobserved heterogeneity, endogeneity, and persistence of corporate tax behavior.
Reducing variability in solder-joint quality and cycle time.
Abstract statement that variability in solder-joint quality and cycle time was reduced during the deployment (no quantitative variability metrics provided in the abstract).
It maintained near-human takt time.
Abstract claim comparing the system's cycle/takt time to human performance during the deployment (no numeric takt-time comparison provided in the abstract).
Achieving a 99.4% pass rate on product-level quality-control tests.
Reported QC pass rate from the production run in the abstract (presumably based on the produced motors).
Operating without physical fencing.
Abstract statement that the run occurred "without physical fencing" (implying operation around people without traditional fences).
Produced 108 motors.
Count of products produced during the continuous run reported in the abstract.
The system operated continuously for 5 h 10 min.
Reported continuous operation duration from the production run described in the abstract.
Less than 20 min of real-world data per task.
Reported training data requirement for the deployed tasks in the authors' field experiment (abstract statement).
With less than 20 min of real-world data per task, the system operated continuously for 5 h 10 min, producing 108 motors without physical fencing and achieving a 99.4% pass rate on product-level quality-control tests.
Single field deployment / production run reported in the paper; numbers reported in the abstract (training data time, continuous operation duration, number of motors produced, fencing status, QC pass rate).
We deployed the system on an electric-motor production line to automate deformable cable insertion and soldering under real manufacturing constraints, a step previously performed manually by human workers.
Field deployment on an actual electric-motor production line described by the authors (deployment + task specification).
We present Learning-Augmented Robotic Automation, a hybrid system that integrates learned task controllers and a neural 3D safety monitor into conventional industrial workflows.
Description of the system developed by the authors (system design/development reported in the paper).
Self-correction should be treated not as a default behavior, but as a control decision governed by measurable error dynamics.
Synthesis of theoretical framing (Markov model and diagnostic inequality) and empirical results across multiple models/datasets showing thresholds and promptability of EIR.
A 'verify-first' prompt ablation on GPT-4o-mini reduces EIR from 2% to 0% and turns -6.2 pp degradation into +0.2 pp (paired McNemar p < 10^-4).
A prompt-ablation experiment reported for GPT-4o-mini showing EIR dropping from 2% to 0% and the observed accuracy change flipping from -6.2 percentage points to +0.2 percentage points; statistical significance assessed with a paired McNemar test (p < 10^-4).
In this framework, EIR functions as a stability margin and prompting functions as lightweight controller design.
Conceptual framing in the paper (cybernetic feedback loop where the same language model is controller and plant), supported by associated experiments showing prompt changes affect EIR and outcomes.
Iterate only when ECR/EIR > Acc/(1 - Acc).
The paper frames self-correction as a two-state Markov model over {Correct, Incorrect} and derives this deployment diagnostic analytically from that model.
Im Forschungskontext sind kontextbezogene Schulungs- und Begleitmaßnahmen entscheidend für den Erfolg der Copilot-Einführung.
Schlussfolgerung der Autoren aus den Befunden zur zeitlichen Entwicklung der Bewertungen wissenschaftlicher Mitarbeitender und zu unterschiedlichen Nutzenwahrnehmungen (im Abstract genannt).
Die Untersuchung zeigt, dass Microsoft 365 Copilot insbesondere im administrativen Bereich Effizienzgewinne ermöglicht.
Selbstberichtete Einschätzungen der Beschäftigten (speziell Verwaltungsmitarbeitende) in der wiederholten Querschnittsbefragung; Autoren ziehen daraus praktische Relevanz im administrativen Bereich (Abstract).
Die Befunde unterstreichen die Bedeutung kontextspezifischer Einführung, rollenbezogener Qualifizierung und Governance für eine nachhaltige Akzeptanz generativer KI in Organisationen.
Interpretation/Schlussfolgerung der Autoren basierend auf den survey-Ergebnissen und beobachteten Unterschieden zwischen Rollen sowie zeitlichen Entwicklungen (im Abstract formuliert).
Der größte Mehrwert von Copilot liegt bei klar strukturierten, textbasierten Aufgaben.
Befragungsergebnisse zur Nutzenabschätzung für typische Tätigkeiten der Wissensarbeit, wie im Abstract zusammengefasst (präferierte Aufgabenarten: strukturierte, textbasierte Aufgaben).
Microsoft 365 Copilot wird überwiegend als benutzerfreundlich und technisch zuverlässig wahrgenommen.
Selbstberichtete Beurteilungen zu Benutzerfreundlichkeit und technischer Zuverlässigkeit in der wiederholten Querschnittsbefragung (Angabe im Abstract).
Wissenschaftliche Mitarbeitende entwickeln im Zeitverlauf positivere Einschätzungen, insbesondere hinsichtlich Produktivität und Arbeitserleichterung durch Copilot.
Längsschnittähnliche Beobachtung über die wiederholten Querschnittserhebungen; zeitliche Veränderung der Selbsteinschätzungen wissenschaftlicher Mitarbeitender im Abstract beschrieben.