Evidence (13870 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	196	98	892	1984
Governance & Regulation	817	394	188	121	1544
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	627	233	123	96	1088
Research Productivity	411	123	56	332	933
Output Quality	467	178	59	47	751
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	167	122	24	496
Task Allocation	207	64	71	32	379
Skill Acquisition	165	59	60	17	301
Innovation Output	203	27	43	18	292
Employment Level	105	52	107	13	279
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	150	48	26	3	227
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	63	20	12	184
Error Rate	69	92	10	2	173
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	93	21	13	19	148
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Creative Output	31	17	7	3	59
Skill Obsolescence	5	46	6	1	58
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Speedrunning Track includes an open-source multi-agent orchestration system and standardized evaluation scenarios for reproducible multi-agent comparisons.

Paper describes and releases an open-source orchestration harness for orchestrating LLMs/agents and provides standardized scenarios and evaluation tools meant for reproducibility.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... availability of open-source orchestration code and standardized evaluation scena...

Community interest in the benchmark was validated by a NeurIPS 2025 competition with 100+ teams and published analyses of winning submissions.

Paper reports organization/validation via a NeurIPS 2025 competition, states participation of 100+ teams, and includes documentation/analyses of top submissions.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... number of competing teams (100+), availability of competition analyses/winning s...

The project is a living benchmark: the Battling Track has a live leaderboard and the Speedrunning Track uses self-contained evaluation to ensure reproducibility.

Paper/documentation notes a live leaderboard for Battling and provides self-contained evaluation pipelines/orchestration for Speedrunning intended to support reproducible runs.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... presence of live leaderboard and self-contained evaluation pipelines

Baselines include heuristic rule-based agents, reinforcement-learning (RL) agents trained for specialist play, and LLM-based agents/harnesses for generalist approaches.

Paper presents baseline implementations and experiments spanning heuristic, RL, and LLM-based agents and describes training procedures and architectures used for each baseline category.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... presence and types of baseline agents (heuristic, RL, LLM)

The benchmark is split into two complementary tracks: a Battling Track (competitive, partial-observability battles) and a Speedrunning Track (long-horizon RPG tasks with a multi-agent orchestration harness).

Paper structure and dataset descriptions specify two tracks, their scopes, and the inclusion of a multi-agent orchestration system for the Speedrunning Track.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... benchmark partitioning (presence of Battling and Speedrunning tracks)

The Battling Track dataset contains more than 20 million recorded battle trajectories.

Paper reports a Battling Track dataset of >20M recorded battle trajectories collected from simulated/match play; size reported explicitly in dataset and methods section.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... number of recorded battle trajectories (>20,000,000)

PokeAgent Challenge is a large, realistic multi-agent benchmark built on Pokemon that stresses partial observability, game-theoretic reasoning, and long-horizon planning simultaneously.

Paper describes design and motivation of the benchmark, detailing two tracks (Battling and Speedrunning) intended to capture partial observability, adversarial/game-theoretic interactions, and long-horizon sequential planning; benchmark implementation built on Pokemon simulator and described task specifications.

high positive The PokeAgent Challenge: Competitive and Long-Context Learni... benchmark task characteristics (partial observability, game-theoretic complexity...

iDaVIE's modular architecture supports extensibility (planned features include subcube loading, advanced render modes, video scripting, and collaborative VR sessions).

Paper describes modular architecture and lists planned/possible future features; this is a software design claim rather than an empirical result.

high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... software extensibility and planned feature set

Because iDaVIE is open-source and extensible, software licensing costs are low and marginal adoption costs fall over time.

Paper states iDaVIE is open-source and designed for community-driven enhancements; economic claim based on general properties of open-source software rather than empirical cost accounting.

high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... licensing cost implication and marginal adoption costs

iDaVIE includes interaction features such as selection, cropping/subcube tools, catalogue overlays, and export back to existing pipelines.

Feature list in paper describing selection, cropping, overlays, in-VR metrics and export functionality; demonstrated integration to export edited masks/subcubes.

high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... availability and functionality of in-VR interaction and export tools

Streaming and downsampling pipelines implemented as Unity plug-ins make large volumes interactively viewable in VR while preserving needed detail for inspection.

Technical description of custom Unity plug-ins for streaming/downsampling and on-the-fly statistics; tested on HI cubes (telescopes listed) per the paper.

high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... interactive rendering performance and retention of inspection-relevant detail

iDaVIE (v1.0) is a working VR software suite that lets astronomers import, render, inspect, and interactively edit very large 3D data cubes in real time.

Described implementation of iDaVIE v1.0 built on Unity/SteamVR with custom plug-ins for parsing/downsampling and real-time rendering; tested on large 3D spectral (HI) cubes from radio telescopes (MeerKAT, ASKAP, APERTIF) as reported in the paper.

high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... ability to import/render/inspect/edit large 3D data cubes in real time (interact...

Personalized LLM coaching produced a statistically significant increase in alignment with the normative empathic taxonomy relative to both the video-based non-personalized feedback and control arms.

Pre-registered randomized experiment with three arms; pre-registered analysis reported statistically significant differences favoring personalized coaching on the primary alignment outcome.

high positive Practicing with Language Models Cultivates Human Empathic Co... statistical difference in alignment to normative empathic patterns (primary outc...

A brief, personalized coaching intervention delivered by a large language model significantly improves participants' alignment with normative, idiomatic empathic communication patterns.

Pre-registered randomized controlled trial with three arms (personalized LLM coaching, video-based non-personalized feedback, control). Outcome measured as alignment to a data-driven normative taxonomy via coding/automated measures. Overall corpus and sample context: 968 participants, 2,904 conversations, 33,938 messages used in the study.

high positive Practicing with Language Models Cultivates Human Empathic Co... alignment with normative empathic patterns (coding/automated alignment metrics)

HindSight reveals a large, real difference between systems that is missed by LLM-based judging (i.e., HindSight detects the retrieval-augmentation advantage while LLM-judged metrics do not).

Combined empirical results: HindSight shows a 2.5× advantage (p < 0.001) for retrieval augmentation while LLM-as-Judge reports no significant difference (p = 0.584).

high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... Detection of performance difference between retrieval-augmented and vanilla gene...

Experiments in the paper cover 10 AI/ML research topics and use a 30-month forward evaluation window.

Experimental setup reported in the paper: scope explicitly stated as 10 AI/ML topics and a 30-month forward window after cutoff T.

high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... Scope parameters (number of topics = 10; forward window length = 30 months)

Generated ideas can be algorithmically compared to future publications and matched items can be assigned scores reflecting downstream impact (citation counts and venue acceptance).

Method section: description of algorithmic matching procedure and scoring rules that use citation counts and venue acceptance as impact proxies.

high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... Match indicators and downstream-impact scores (citations, venue acceptance) for ...

A retrieval-augmented idea generator produces 2.5× higher-scoring ideas than a vanilla generator according to HindSight (p < 0.001).

Empirical comparison reported in the paper across the specified experiments (10 AI/ML topics, time-split at T, 30-month forward window); statistical test reporting a 2.5× difference with p < 0.001.

high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... HindSight score (downstream-impact-based score for generated ideas)

HindSight is a time-split, retrospective evaluation that (1) restricts idea generation to pre-cutoff literature (time T), (2) compares generated ideas to papers published in the following 30 months, and (3) scores matches by downstream impact (citation counts and venue acceptance).

Method described in paper: time-split protocol with a temporal cutoff T, a 30-month forward window, algorithmic matching of generated ideas to later publications, and scoring based on downstream impact metrics (citations and venue acceptance).

high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... HindSight match score computed from matches to later publications weighted by ci...

The paper introduces a Multi-Object Decoder (MOD) that extends SAM 3D to jointly reconstruct multiple objects from a single image, targeting physically plausible, non-penetrating object configurations and realistic contacts.

Method section: MOD is described as an extension of the single-object SAM 3D architecture to jointly decode multiple object shapes and poses from a monocular image; the method explicitly aims to reduce inter-object penetration and model contacts.

high positive MessyKitchens: Contact-rich object-level 3D scene reconstruc... methodological capability: joint multi-object monocular 3D reconstruction, objec...

LEAFE achieves up to a 14% absolute improvement on Pass@128 versus the strongest baselines.

Empirical result explicitly reported in the paper: maximum observed improvement 'up to +14% Pass@128' in comparisons to baselines on the experimental tasks.

high positive Internalizing Agency from Reflective Experience Pass@128 (absolute percentage point improvement)

Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k under fixed interaction budgets.

Head-to-head experimental comparisons reported between LEAFE and baselines GRPO and Early Experience on the task suite; fixed interaction-budget experimental regime; Pass@1 and Pass@k used as evaluation metrics.

high positive Internalizing Agency from Reflective Experience Pass@1 and Pass@k (fraction of problems solved among k candidate runs)

LEAFE substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback.

Reported experiments on a suite of long-horizon interactive tasks (multi-step coding and agentic tasks) comparing LEAFE to baselines; evaluation using Pass@k metrics under fixed interaction budgets; qualitative description that LEAFE internalizes recovery behavior from environment feedback.

high positive Internalizing Agency from Reflective Experience Long-horizon agentic performance measured by Pass@k (Pass@1, Pass@k, Pass@128)

The RL fine-tuned Qwen2.5-Coder-7B improves 33.1% over the same base 7B model without RL fine-tuning.

Head-to-head comparison between the tuned model and its untuned base across the 48 evaluation briefs; reported improvement of +33.1%.

high positive Learning to Present: Inverse Specification Rewards for Agent... Absolute or relative quality improvement (%) of tuned vs. untuned Qwen2.5-Coder-...

Fine-tuning a parameter-efficient 7B model (Qwen2.5-Coder-7B) via reinforcement learning in an OpenEnv-compatible environment yields near-state-of-the-art automated slide-generation: the tuned 7B model reaches 91.2% of Claude Opus 4.6’s quality.

Empirical evaluation on 48 diverse business briefs comparing six models; reported relative quality score of tuned Qwen2.5-Coder-7B = 91.2% of Claude Opus 4.6.

high positive Learning to Present: Inverse Specification Rewards for Agent... Relative slide-generation quality (percent of Claude Opus 4.6 quality) across 48...

Managing captures, traces, and replay sessions from a unified single design database ensures consistency across replay targets and sessions.

Method description emphasizes a single design database coordinating captures and replays across simulation and emulation for the demonstrator system. (Operational claim demonstrated in the implementation; no metrics on error reduction provided.)

high positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... consistency of trace/replay data and configuration across targets

The captured traces can be deterministically replayed across different execution targets (software/hardware simulation and hardware emulation), reducing cross-platform setup complexity and discrepancies.

The same captured waveforms/traces were replayed on both simulation and emulation environments for the ODIN demonstrator; cross-target replay was part of the described method. (Demonstrated on the single reported system; no broad cross-toolchain study provided.)

high positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... consistency of reproduced behavior across simulator and emulator targets

Using the proportional veto core provides formal protection for minority blocs by giving them proportional blocking power, thus encoding a proportional fairness guarantee compared to simple majoritarian rules.

Definition and properties of the proportional veto core presented in the paper; conceptual discussion comparing veto/proportionality guarantees to majoritarian outcomes.

high positive Finding Common Ground in a Sea of Alternatives existence of proportional blocking power / protection for minority groups as for...

The paper characterizes the information cost of aggregating preferences when AI can generate essentially unlimited candidate alternatives by providing tight sample-complexity bounds and lower bounds.

The combination of sampling-model formalization, sample-complexity upper bounds, and matching lower bounds constitutes a formal characterization of the information (sample) requirements.

high positive Finding Common Ground in a Sea of Alternatives sample/query complexity as the measure of information cost

The authors prove an upper bound on the number of samples/queries required by their algorithm as a function of accuracy, confidence, and problem parameters.

Theoretical analysis in the paper deriving explicit sample-complexity upper bounds (stated as functions of accuracy/confidence and relevant parameters).

high positive Finding Common Ground in a Sea of Alternatives sample/query complexity required for the algorithm to achieve specified accuracy...

Under only query (sampling) access to the unknown joint distribution of voters and alternatives, there is an efficient sampling-based algorithm that, with high probability, returns an alternative in the approximate proportional veto core.

Constructive algorithm and correctness proof in the paper showing the algorithm returns an approximate core alternative with high probability under the sampling access model.

high positive Finding Common Ground in a Sea of Alternatives probability that the algorithm's output lies in the approximate proportional vet...

The paper formalizes the proportional veto core for settings with an infinite alternative space and voters whose preferences are drawn from an unknown distribution.

Formal model and definitions presented in the paper: extension of the proportional veto core to an infinite alternative space and definitions for sampling-appropriate approximate proportional veto core.

high positive Finding Common Ground in a Sea of Alternatives formal definition / existence of an appropriate approximate proportional veto-co...

Temporally grounding model inputs (constraining models to contemporaneous public information at each node) substantially reduces the risk of training-data leakage and hindsight bias.

Study design enforced node-specific contemporaneous evidence constraints for each of the 11 nodes; methodological rationale and comparison to unconstrained settings described as reducing retrospective information contamination.

high positive When AI Navigates the Fog of War presence/absence or reduction of training-data leakage/hindsight bias (procedura...

BATQuant significantly outperforms prior post-training quantization (PTQ) methods on MXFP4 microscaling floating-point formats under aggressive quantization.

Comparative experiments against rotation-based PTQ techniques and other existing PTQ baselines on the described multimodal and language tasks; improvements shown in benchmark metrics and recovery percentages in the paper's experimental section.

high positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Task-specific accuracy/quality metrics and percent recovery relative to full-pre...

BATQuant recovers up to 96.43% of full-precision performance under aggressive W4A4KV16 quantization on MLLMs and LLMs.

Empirical evaluation reported in the paper: experiments on multiple multimodal large language models (MLLMs) and standard LLMs using an aggressive W4A4KV16 quantization setup; performance reported as percentage of full-precision performance recovered (specific models, benchmark names, and exact sample sizes not enumerated in the summary).

high positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Percentage of full-precision performance recovered (model quality/accuracy on mu...

The paper provides concrete, regulation-inspired policy examples (e.g., content prohibition, sensitive data exfiltration) showing how they map into the Policy function.

Worked, illustrative examples included in the paper mapping regulatory constraints to the Policy(agent_id, partial_path, proposed_action, org_state) formalism.

high positive Runtime Governance for AI Agents: Policies on Paths representability of regulation-inspired policies in the formalism (yes/no; examp...

Runtime policy evaluation can intercept, score, log, allow/modify/block actions, and update organizational state as part of an agent's execution loop (reference implementation architecture).

Reference implementation design described in the paper (runtime policy evaluator hooks, logging, enforcement actions); architectural reasoning and pseudo-workflows provided; no production deployment data.

high positive Runtime Governance for AI Agents: Policies on Paths feasibility of integrating runtime policy evaluator into agent loops (architectu...

Policies can be formalized as deterministic functions p_violation = Policy(agent_id, partial_path, proposed_action, org_state) that return a probability or score of violation for a proposed next action.

Formal definition and mapping in the paper; worked examples showing how regulatory-style constraints map into this function; no large-scale empirical validation.

high positive Runtime Governance for AI Agents: Policies on Paths expressiveness of policy formalism (ability to represent targeted constraints)

Effective governance for agentic LLM systems requires treating the execution path as the central object and performing runtime evaluation of proposed next actions given the partial path.

Theoretical argument and formal proposal of runtime policy evaluator that takes (agent_id, partial_path, proposed_action, org_state) and returns a violation probability; reference architecture described; illustrative examples.

high positive Runtime Governance for AI Agents: Policies on Paths governance effectiveness for path-dependent policies (qualitative/coverage)

Multiple off-the-shelf vision-language models (closed-source and open-source) representative of current state-of-the-art architectures were benchmarked.

Paper reports experiments across a mix of closed-source and open-source VLMs; exact model names provided in the released materials.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... models evaluated (variety and representativeness)

Evaluation targets include correctness, consistency, and update efficacy, operationalized via quantitative metrics (accuracy, consistency rates, update success rate).

Methods section describing evaluation metrics and how correctness, consistency, and update efficacy are measured across experiments.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... metrics used: accuracy, consistency rate, update success rate

A curated set of time-sensitive factual items (e.g., officeholders, company statuses, recent awards/results) was used to construct the benchmark.

Benchmark composition description listing categories of time-sensitive facts and methodology for curation of items used in experiments.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... composition of benchmark item set

The authors release the V-DyKnow benchmark, code, and evaluation data for community use.

Statement in paper and accompanying release materials indicating benchmark, code, and evaluation data are publicly available.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... availability of benchmark, code, and data

V-DyKnow is a benchmark specifically designed to evaluate time-sensitive factual knowledge in vision-language models across both text and image modalities.

Release and description of the benchmark in the paper: curated set of time-sensitive factual items, paired multimodal stimuli (text + images), input perturbations, and evaluation scripts. Methodological description of benchmark composition and tasks.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... benchmark existence / capability to evaluate time-sensitive multimodal factual k...

Ethical handling: the study involved sensitive material (self-harm, trauma) and authors applied validation and careful handling consistent with research ethics.

Ethics section and methods describing sensitivity of material and precautions taken in data handling and validation.

high positive Characterizing Delusional Spirals through Human-LLM Chat Log... ethical procedures applied to sensitive data

Selected coded items (for example, suicidal messages) were validated by the authors to increase reliability of certain critical annotations.

Methods section describing validation procedures applied to selected items such as suicidal ideation.

high positive Characterizing Delusional Spirals through Human-LLM Chat Log... validation status of coded items (e.g., number of validated suicidal messages)

The authors developed and applied a manual codebook of 28 behavioral/phenomenological codes (e.g., delusional thinking, suicidal ideation, chatbot sentience claims, romantic interest) across the full corpus.

Method section describing construction of a 28-code inventory and manual coding applied to entire dataset.

high positive Characterizing Delusional Spirals through Human-LLM Chat Log... existence and application of a 28-code annotation scheme

The surrogate-driven inverse-design pipeline transfers to physical hardware — designs produced by the CNN+GA pipeline were realized and validated experimentally.

Two fabricated prototypes implemented the optimized pixelated combiners and GaN HEMT Doherty PAs; measured performance metrics correspond to the designs, demonstrating transfer from surrogate-driven design to hardware.

high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... consistency between surrogate-driven design outputs and measured prototype perfo...

Under a 20 MHz 5G-NR-like waveform (9 dB PAPR) with digital predistortion (DPD), each prototype reached average PAE greater than 51% while meeting ACLR ≤ −60.8 dBc.

Realistic waveform testing described: a 20 MHz 5G‑NR-like signal with 9 dB PAPR was applied to the prototypes, DPD was used, and measurements reported average PAE > 51% and ACLR ≤ −60.8 dBc for each prototype.

high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... average power-added efficiency (PAE %) and adjacent channel leakage ratio (ACLR,...

Each prototype demonstrated drain efficiency greater than 52% at 9 dB back-off.

Back-off efficiency measurements reported for the fabricated prototypes showing drain efficiency > 52% at 9 dB back-off.

high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... drain efficiency at 9 dB back-off (%)

« Prev 1 2 3 … 175 176 177 … 277 278 Next »