Evidence (13661 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	740	192	95	871	1945
Governance & Regulation	796	388	185	119	1512
Organizational Efficiency	765	186	123	82	1166
Technology Adoption Rate	610	227	121	95	1061
Research Productivity	409	121	56	331	928
Output Quality	464	174	58	47	743
Decision Quality	318	173	75	42	615
Firm Productivity	432	55	88	20	601
AI Safety & Ethics	214	273	65	33	589
Market Structure	175	165	120	24	489
Task Allocation	206	64	70	31	376
Skill Acquisition	161	57	57	16	291
Innovation Output	201	27	41	18	288
Fiscal & Macroeconomic	130	69	43	26	275
Employment Level	104	50	105	13	274
Consumer Welfare	116	62	42	11	231
Firm Revenue	149	45	26	3	223
Inequality Measures	43	120	49	6	218
Task Completion Time	164	29	8	12	214
Worker Satisfaction	89	60	20	12	181
Error Rate	69	89	9	2	169
Regulatory Compliance	74	67	14	4	159
Training Effectiveness	91	19	13	19	144
Wages & Compensation	77	33	25	6	141
Team Performance	86	17	27	9	140
Automation Exposure	49	50	22	12	136
Developer Productivity	91	17	14	5	128
Job Displacement	12	80	19	1	112
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	16	7	2	57
Skill Obsolescence	5	43	6	1	55
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

In a user study where 12 participants created slide decks, MindTrellis outperformed retrieval-only baselines in knowledge organization and cognitive load, as measured by expert ratings of content coverage and structural quality.

Controlled user study reported in the paper: N = 12 participants performing slide-deck creation tasks; outcomes assessed via expert ratings of content coverage and structural quality (comparison to retrieval-only baseline).

high positive MindTrellis: Co-Creating Knowledge Structures with AI throug... knowledge organization and cognitive load (operationalized via expert ratings of...

MindTrellis is an interactive visual system where users and AI collaboratively build a dynamic knowledge graph; users can query the graph for document-grounded information and contribute by introducing new concepts, modifying relationships, and reorganizing the hierarchy.

System design and implementation described in the paper (feature description and demonstration).

high positive MindTrellis: Co-Creating Knowledge Structures with AI throug... system capability to support collaborative construction and manipulation of a dy...

The framework, all collected signals, scoring outputs, and evaluation harness are released under CC BY 4.0.

Statement of data and code release policy in the paper.

high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... availability/license of framework and data

AgentPulse surfaces deployment signal absent from benchmarks; it is a methodology, not a ground-truth ranking.

Conceptual and empirical argument in the paper supported by the analyses described (correlations with external adoption proxies and divergence from benchmark-only rankings).

high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... presence of deployment/adoption signals not captured by standard benchmarks

The Benchmark+Sentiment sub-composite correlates with VS Code installs (ρ_s=0.44, p<0.05), reported as illustrative given that only 11 of 35 agents have non-zero installs.

Spearman correlation between Benchmark+Sentiment sub-composite and VS Code installs on the 35-agent sample, with a caveat that installs are non-zero for only 11 agents; reported correlation and p-value.

high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... VS Code installs (IDE install counts as adoption proxy)

The Benchmark+Sentiment sub-composite predicts Stack Overflow question volume (ρ_s=0.49, p<0.01) in the circularity-controlled test (n=35).

Circularity-controlled Spearman correlation between Benchmark+Sentiment sub-composite and Stack Overflow question volume on 35 agents; reported correlation and p-value.

high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... Stack Overflow question volume (external adoption/engagement proxy)

A circularity-controlled test (n=35) shows the Benchmark+Sentiment sub-composite, which contains no GitHub-derived signals, predicts external adoption proxies it does not aggregate: GitHub stars (ρ_s=0.52, p<0.01).

Circularity-controlled correlation test (Spearman) between Benchmark+Sentiment sub-composite and GitHub stars on a 35-agent sample; reported Spearman correlation and p-value.

high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... GitHub stars (external adoption proxy)

We introduce AgentPulse, a continuous evaluation framework scoring 50 agents across 10 workload categories along four factors (Benchmark Performance, Adoption Signals, Community Sentiment, and Ecosystem Health) aggregated from 18 real-time signals across GitHub, package registries, IDE marketplaces, social platforms, and benchmark leaderboards.

Methodological description in the paper; reported sample of 50 agents and use of 18 signals from enumerated sources.

high positive AgentPulse: A Continuous Multi-Signal Framework for Evaluati... AgentPulse composite and factor scores (Benchmark Performance, Adoption Signals,...

A follow-up intervention where we add information about capabilities from prior experiments to the context improves calibration.

Follow-up experimental intervention reported in the paper: augmenting model context with prior capability information and measuring calibration change.

high positive MarketBench: Evaluating AI Agents as Market Participants change in calibration of predicted success probability and token usage after add...

Markets are a promising way to coordinate AI agent activity for similar reasons to those used to justify markets more broadly.

Conceptual/theoretical argument presented in the paper (no empirical test reported in the excerpt).

high positive MarketBench: Evaluating AI Agents as Market Participants suitability of markets for coordinating AI agents (theoretical promise)

These results provide concrete tier-selection guidance across deployment scales from a single seminar to a university-wide rollout.

Concluding claim in paper based on empirical latency and cost comparisons across throughput tiers and concurrency up to 50 users from a live deployment.

high positive Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... tier-selection guidance for deployment scale decision-making

Provisioned Throughput, expensive under continuous provisioning, becomes cost-competitive for institutions that can predict and concentrate their traffic toward high utilization.

Cost modeling and comparison across provisioning modes in the paper showing trade-offs conditional on utilization; based on the instrumented tiers and assumed usage patterns.

high positive Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... cost competitiveness (cost per unit of usage vs utilization)

Cost analysis places both pay-per-token tiers well below the price of a STEM textbook per student per semester under a worst-case usage ceiling.

Cost analysis presented in the paper comparing per-student per-semester costs under a stated worst-case usage ceiling to the price of a STEM textbook; exact numeric assumptions not provided in the excerpt.

high positive Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... cost per student per semester

Priority PayGo maintains flat sub-4-second response times across the full load range.

Empirical latency measurements from the ITAS instrumentation described in the paper (over 3,000 requests across concurrency levels up to 50 and three throughput tiers).

high positive Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response time (latency)

Multi-agent LLM tutoring systems improve response quality through agent specialization.

Statement in paper describing design rationale; no quantitative quality comparison or metrics provided in the excerpt.

high positive Latency and Cost of Multi-Agent Intelligent Tutoring at Scal... response quality

LLMs reveal their ability to approximate adaptive equilibria beyond static mechanism design.

Interpretive claim based on simulation results showing LLM bidders adapt and achieve favorable outcomes when mechanism assumptions fail (e.g., static budgets). This is an inferential claim drawn from comparative experiments described in the paper; no numerical quantification provided in the excerpt.

high positive Strategic Bidding in 6G Spectrum Auctions with Large Languag... ability to approximate adaptive equilibria (strategic adaptation capability)

When theoretical assumptions break—such as under static budget constraints—LLMs sustain longer participation and achieve higher utilities.

Reported simulation results under scenarios violating VCG truthfulness assumptions (example: static budget constraints). The paper states LLM bidders maintained participation for longer and obtained higher utility than truthful and heuristic strategies in these scenarios. No numeric sample sizes or quantified effect sizes provided in the abstract.

high positive Strategic Bidding in 6G Spectrum Auctions with Large Languag... participation duration and agent utility

Unlike heuristics, LLMs leverage historical outcomes and prompt-based reasoning to adapt their bidding behavior dynamically.

Method description and reported behavioral difference: the paper states LLM bidders incorporate prior auction outcomes and prompt engineering to inform bids, contrasted with static heuristic strategies. Based on simulation implementation rather than field deployment; no sample size provided.

high positive Strategic Bidding in 6G Spectrum Auctions with Large Languag... bidding behavior adaptability (dynamic adaptation using history and prompts)

Generative artificial intelligence (genAI) is rapidly reshaping how knowledge and culture are produced and consumed.

Author's descriptive statement based on observed changes in production/consumption patterns (no empirical sample reported in paper abstract).

high positive Generative artificial intelligence reduces social welfare th... production and consumption of knowledge and culture

The paper provides a structured overview of energy forecasting use cases along three main dimensions: stakeholders, attributes, and data categories.

Methodological contribution described in the paper (framework/overview component).

high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... existence of a structured overview framework

The FETS benchmark collects and analyzes 54 datasets across 9 data categories guided by typical stakeholder interests.

Descriptive claim about the benchmark dataset compilation reported in the paper (dataset count and category count).

high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... breadth of dataset coverage (count and categories)

Foundation models show improved performance at higher aggregation levels such as national load, district heating, and power grid data.

Subset analyses within the benchmark comparing performance across aggregation levels (e.g., national load, district heating, power grid) showing better results for aggregated data.

high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... forecast accuracy stratified by aggregation level

Foundation models outperform classical machine learning approaches despite the latter having seen the full historic target data during training.

Benchmark setup where classical ML baselines had access to full historic target data during training while foundation models were pretrained/generalized; empirical comparison across the dataset collection.

high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... forecasting accuracy when classical ML had access to full historic targets

Covariate-informed foundation models achieve the strongest performance.

Benchmark experiments that compare foundation model variants, including those that incorporate covariates, across the collected datasets; reported as a comparative finding in the benchmark results.

high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... predictive performance of covariate-informed vs non-covariate models

Foundation models consistently outperform dataset-specific optimized machine learning approaches across all settings and data categories.

Empirical benchmark comparing foundation models vs classical dataset-specific ML approaches across multiple forecasting settings and data categories; reported analysis uses the collection of datasets assembled for the FETS benchmark.

high positive FETS Benchmark: Foundation Models Outperform Dataset-specifi... predictive performance of time series forecasts (forecast accuracy/output qualit...

This paper presents the first systematic study of token consumption patterns in agentic coding tasks, analyzing trajectories from eight frontier LLMs on SWE-bench Verified and evaluating models' ability to predict their own token costs before task execution.

Stated scope and methodology in the paper: dataset is SWE-bench Verified, eight frontier LLMs were analyzed, and experiments included model self-prediction evaluation.

high positive How Do AI Agents Spend Your Money? Analyzing and Predicting ... scope of study (presence of systematic analysis and self-prediction evaluation)

Technological capability (AI) and board diversity are complementary in strengthening corporate governance and fiscal discipline in developing economies.

Synthesis/interpretation of empirical results (main effects and interaction effects) from panel regressions and robustness analyses on 1,586 firms across 2009–2023.

high positive AI-Enabled Governance: Board Gender Diversity and Corporate ... corporate governance effectiveness and fiscal discipline (proxied by tax complia...

The main findings are robust to alternative tax avoidance measures, alternative BGD specifications, heterogeneity analyses, and selection-bias corrections (Heckman, propensity score matching, and instrumental-variable 2SLS approaches).

Reported robustness checks in the paper applying multiple alternative variable specifications and methods for selection-bias correction on the primary sample.

high positive AI-Enabled Governance: Board Gender Diversity and Corporate ... stability/robustness of the association between BGD (and its interaction with AI...

AI capability significantly strengthens the relationship between BGD and effective tax rates; firms with higher AI adoption exhibit a stronger governance effect of gender-diverse boards on tax compliance.

Interaction models estimated on the same balanced panel (1,586 firms, 2009–2023) using lagged AI capability specification; estimated with firm FE and dynamic two-step System GMM, with reported statistically significant interaction effects.

high positive AI-Enabled Governance: Board Gender Diversity and Corporate ... effective tax rate / tax compliance

Board gender diversity (BGD) is positively associated with effective tax rates, implying lower levels of corporate tax avoidance.

Empirical analysis using a balanced panel of 1,586 non-financial firms from developing economies over 2009–2023; firm fixed effects models and dynamic two-step System GMM estimations used to address unobserved heterogeneity, endogeneity, and persistence of corporate tax behavior.

high positive AI-Enabled Governance: Board Gender Diversity and Corporate ... effective tax rate (ETR) / level of corporate tax avoidance

Reducing variability in solder-joint quality and cycle time.

Abstract statement that variability in solder-joint quality and cycle time was reduced during the deployment (no quantitative variability metrics provided in the abstract).

high positive Learning-augmented robotic automation for real-world manufac... variability of solder-joint quality; variability of cycle time

It maintained near-human takt time.

Abstract claim comparing the system's cycle/takt time to human performance during the deployment (no numeric takt-time comparison provided in the abstract).

high positive Learning-augmented robotic automation for real-world manufac... takt time (cycle time) relative to human workers

Achieving a 99.4% pass rate on product-level quality-control tests.

Reported QC pass rate from the production run in the abstract (presumably based on the produced motors).

high positive Learning-augmented robotic automation for real-world manufac... product-level quality-control pass rate

Operating without physical fencing.

Abstract statement that the run occurred "without physical fencing" (implying operation around people without traditional fences).

high positive Learning-augmented robotic automation for real-world manufac... use of physical fences for safety (absent)

Produced 108 motors.

Count of products produced during the continuous run reported in the abstract.

high positive Learning-augmented robotic automation for real-world manufac... number of motors produced during the run

The system operated continuously for 5 h 10 min.

Reported continuous operation duration from the production run described in the abstract.

high positive Learning-augmented robotic automation for real-world manufac... continuous operational time without interruption

Less than 20 min of real-world data per task.

Reported training data requirement for the deployed tasks in the authors' field experiment (abstract statement).

high positive Learning-augmented robotic automation for real-world manufac... amount of real-world training data per task

With less than 20 min of real-world data per task, the system operated continuously for 5 h 10 min, producing 108 motors without physical fencing and achieving a 99.4% pass rate on product-level quality-control tests.

Single field deployment / production run reported in the paper; numbers reported in the abstract (training data time, continuous operation duration, number of motors produced, fencing status, QC pass rate).

high positive Learning-augmented robotic automation for real-world manufac... training data required; continuous operational duration; production quantity; pr...

We deployed the system on an electric-motor production line to automate deformable cable insertion and soldering under real manufacturing constraints, a step previously performed manually by human workers.

Field deployment on an actual electric-motor production line described by the authors (deployment + task specification).

high positive Learning-augmented robotic automation for real-world manufac... automation of previously manual deformable cable insertion and soldering tasks

We present Learning-Augmented Robotic Automation, a hybrid system that integrates learned task controllers and a neural 3D safety monitor into conventional industrial workflows.

Description of the system developed by the authors (system design/development reported in the paper).

high positive Learning-augmented robotic automation for real-world manufac... integration of learned controllers and 3D safety monitoring

Self-correction should be treated not as a default behavior, but as a control decision governed by measurable error dynamics.

Synthesis of theoretical framing (Markov model and diagnostic inequality) and empirical results across multiple models/datasets showing thresholds and promptability of EIR.

high positive When Does LLM Self-Correction Help? A Control-Theoretic Mark... policy/recommendation about when to enable iterative self-correction to improve ...

A 'verify-first' prompt ablation on GPT-4o-mini reduces EIR from 2% to 0% and turns -6.2 pp degradation into +0.2 pp (paired McNemar p < 10^-4).

A prompt-ablation experiment reported for GPT-4o-mini showing EIR dropping from 2% to 0% and the observed accuracy change flipping from -6.2 percentage points to +0.2 percentage points; statistical significance assessed with a paired McNemar test (p < 10^-4).

high positive When Does LLM Self-Correction Help? A Control-Theoretic Mark... EIR and accuracy change from self-correction after prompt modification

In this framework, EIR functions as a stability margin and prompting functions as lightweight controller design.

Conceptual framing in the paper (cybernetic feedback loop where the same language model is controller and plant), supported by associated experiments showing prompt changes affect EIR and outcomes.

high positive When Does LLM Self-Correction Help? A Control-Theoretic Mark... stability of iterative refinement (EIR) and resulting accuracy

Iterate only when ECR/EIR > Acc/(1 - Acc).

The paper frames self-correction as a two-state Markov model over {Correct, Incorrect} and derives this deployment diagnostic analytically from that model.

high positive When Does LLM Self-Correction Help? A Control-Theoretic Mark... whether iterative self-correction is expected to improve accuracy

Im Forschungskontext sind kontextbezogene Schulungs- und Begleitmaßnahmen entscheidend für den Erfolg der Copilot-Einführung.

Schlussfolgerung der Autoren aus den Befunden zur zeitlichen Entwicklung der Bewertungen wissenschaftlicher Mitarbeitender und zu unterschiedlichen Nutzenwahrnehmungen (im Abstract genannt).

high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Bedeutung von Schulungs- und Begleitmaßnahmen für Erfolg/Adoption

Die Untersuchung zeigt, dass Microsoft 365 Copilot insbesondere im administrativen Bereich Effizienzgewinne ermöglicht.

Selbstberichtete Einschätzungen der Beschäftigten (speziell Verwaltungsmitarbeitende) in der wiederholten Querschnittsbefragung; Autoren ziehen daraus praktische Relevanz im administrativen Bereich (Abstract).

high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Wahrgenommene Effizienzgewinne im administrativen Bereich

Die Befunde unterstreichen die Bedeutung kontextspezifischer Einführung, rollenbezogener Qualifizierung und Governance für eine nachhaltige Akzeptanz generativer KI in Organisationen.

Interpretation/Schlussfolgerung der Autoren basierend auf den survey-Ergebnissen und beobachteten Unterschieden zwischen Rollen sowie zeitlichen Entwicklungen (im Abstract formuliert).

high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Empfohlene Implementierungsmaßnahmen (Kontextanpassung, Schulung, Governance) zu...

Der größte Mehrwert von Copilot liegt bei klar strukturierten, textbasierten Aufgaben.

Befragungsergebnisse zur Nutzenabschätzung für typische Tätigkeiten der Wissensarbeit, wie im Abstract zusammengefasst (präferierte Aufgabenarten: strukturierte, textbasierte Aufgaben).

high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Wahrgenommener Nutzen nach Aufgabentyp (textbasierte, strukturierte Aufgaben)

Microsoft 365 Copilot wird überwiegend als benutzerfreundlich und technisch zuverlässig wahrgenommen.

Selbstberichtete Beurteilungen zu Benutzerfreundlichkeit und technischer Zuverlässigkeit in der wiederholten Querschnittsbefragung (Angabe im Abstract).

high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Perzipierte Benutzerfreundlichkeit und technische Zuverlässigkeit

Wissenschaftliche Mitarbeitende entwickeln im Zeitverlauf positivere Einschätzungen, insbesondere hinsichtlich Produktivität und Arbeitserleichterung durch Copilot.

Längsschnittähnliche Beobachtung über die wiederholten Querschnittserhebungen; zeitliche Veränderung der Selbsteinschätzungen wissenschaftlicher Mitarbeitender im Abstract beschrieben.

high positive Generative KI in der Wissensarbeit: Wahrnehmung, Nutzen und ... Perzipierte Produktivität und Arbeitserleichterung (Selbsteinschätzung über Zeit...

« Prev 1 2 3 … 129 130 131 … 273 274 Next »