Evidence (8542 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	448	118	70	511	1163
Governance & Regulation	458	217	125	67	884
Research Productivity	274	103	35	303	720
Organizational Efficiency	444	106	78	43	675
Technology Adoption Rate	347	130	76	45	603
Firm Productivity	324	39	73	13	454
Output Quality	273	76	27	30	406
AI Safety & Ethics	122	188	46	27	385
Market Structure	119	134	86	14	358
Decision Quality	182	79	41	20	326
Fiscal & Macroeconomic	95	58	34	22	216
Employment Level	78	37	80	9	206
Skill Acquisition	104	37	41	9	191
Innovation Output	127	12	26	14	180
Firm Revenue	101	38	24	—	163
Task Allocation	95	18	36	8	159
Consumer Welfare	77	38	37	7	159
Inequality Measures	29	81	33	6	149
Regulatory Compliance	54	61	13	3	131
Task Completion Time	92	8	4	3	107
Worker Satisfaction	49	36	13	8	106
Error Rate	45	53	6	—	104
Training Effectiveness	60	13	12	16	102
Wages & Compensation	56	16	20	5	97
Team Performance	51	13	15	8	88
Automation Exposure	28	29	12	7	79
Job Displacement	7	45	13	—	65
Hiring & Recruitment	42	4	7	3	56
Developer Productivity	38	5	4	3	50
Social Protection	22	12	7	2	43
Creative Output	17	8	6	1	32
Skill Obsolescence	3	26	2	—	31
Labor Share of Income	12	7	10	—	29
Worker Turnover	10	12	—	3	25

The RL fine-tuned Qwen2.5-Coder-7B improves 33.1% over the same base 7B model without RL fine-tuning.

Head-to-head comparison between the tuned model and its untuned base across the 48 evaluation briefs; reported improvement of +33.1%.

high positive Learning to Present: Inverse Specification Rewards for Agent... Absolute or relative quality improvement (%) of tuned vs. untuned Qwen2.5-Coder-...

Fine-tuning a parameter-efficient 7B model (Qwen2.5-Coder-7B) via reinforcement learning in an OpenEnv-compatible environment yields near-state-of-the-art automated slide-generation: the tuned 7B model reaches 91.2% of Claude Opus 4.6’s quality.

Empirical evaluation on 48 diverse business briefs comparing six models; reported relative quality score of tuned Qwen2.5-Coder-7B = 91.2% of Claude Opus 4.6.

high positive Learning to Present: Inverse Specification Rewards for Agent... Relative slide-generation quality (percent of Claude Opus 4.6 quality) across 48...

Managing captures, traces, and replay sessions from a unified single design database ensures consistency across replay targets and sessions.

Method description emphasizes a single design database coordinating captures and replays across simulation and emulation for the demonstrator system. (Operational claim demonstrated in the implementation; no metrics on error reduction provided.)

high positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... consistency of trace/replay data and configuration across targets

The captured traces can be deterministically replayed across different execution targets (software/hardware simulation and hardware emulation), reducing cross-platform setup complexity and discrepancies.

The same captured waveforms/traces were replayed on both simulation and emulation environments for the ODIN demonstrator; cross-target replay was part of the described method. (Demonstrated on the single reported system; no broad cross-toolchain study provided.)

high positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... consistency of reproduced behavior across simulator and emulator targets

Using the proportional veto core provides formal protection for minority blocs by giving them proportional blocking power, thus encoding a proportional fairness guarantee compared to simple majoritarian rules.

Definition and properties of the proportional veto core presented in the paper; conceptual discussion comparing veto/proportionality guarantees to majoritarian outcomes.

high positive Finding Common Ground in a Sea of Alternatives existence of proportional blocking power / protection for minority groups as for...

The paper characterizes the information cost of aggregating preferences when AI can generate essentially unlimited candidate alternatives by providing tight sample-complexity bounds and lower bounds.

The combination of sampling-model formalization, sample-complexity upper bounds, and matching lower bounds constitutes a formal characterization of the information (sample) requirements.

high positive Finding Common Ground in a Sea of Alternatives sample/query complexity as the measure of information cost

The authors prove an upper bound on the number of samples/queries required by their algorithm as a function of accuracy, confidence, and problem parameters.

Theoretical analysis in the paper deriving explicit sample-complexity upper bounds (stated as functions of accuracy/confidence and relevant parameters).

high positive Finding Common Ground in a Sea of Alternatives sample/query complexity required for the algorithm to achieve specified accuracy...

Under only query (sampling) access to the unknown joint distribution of voters and alternatives, there is an efficient sampling-based algorithm that, with high probability, returns an alternative in the approximate proportional veto core.

Constructive algorithm and correctness proof in the paper showing the algorithm returns an approximate core alternative with high probability under the sampling access model.

high positive Finding Common Ground in a Sea of Alternatives probability that the algorithm's output lies in the approximate proportional vet...

The paper formalizes the proportional veto core for settings with an infinite alternative space and voters whose preferences are drawn from an unknown distribution.

Formal model and definitions presented in the paper: extension of the proportional veto core to an infinite alternative space and definitions for sampling-appropriate approximate proportional veto core.

high positive Finding Common Ground in a Sea of Alternatives formal definition / existence of an appropriate approximate proportional veto-co...

Temporally grounding model inputs (constraining models to contemporaneous public information at each node) substantially reduces the risk of training-data leakage and hindsight bias.

Study design enforced node-specific contemporaneous evidence constraints for each of the 11 nodes; methodological rationale and comparison to unconstrained settings described as reducing retrospective information contamination.

high positive When AI Navigates the Fog of War presence/absence or reduction of training-data leakage/hindsight bias (procedura...

BATQuant significantly outperforms prior post-training quantization (PTQ) methods on MXFP4 microscaling floating-point formats under aggressive quantization.

Comparative experiments against rotation-based PTQ techniques and other existing PTQ baselines on the described multimodal and language tasks; improvements shown in benchmark metrics and recovery percentages in the paper's experimental section.

high positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Task-specific accuracy/quality metrics and percent recovery relative to full-pre...

BATQuant recovers up to 96.43% of full-precision performance under aggressive W4A4KV16 quantization on MLLMs and LLMs.

Empirical evaluation reported in the paper: experiments on multiple multimodal large language models (MLLMs) and standard LLMs using an aggressive W4A4KV16 quantization setup; performance reported as percentage of full-precision performance recovered (specific models, benchmark names, and exact sample sizes not enumerated in the summary).

high positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Percentage of full-precision performance recovered (model quality/accuracy on mu...

The paper provides concrete, regulation-inspired policy examples (e.g., content prohibition, sensitive data exfiltration) showing how they map into the Policy function.

Worked, illustrative examples included in the paper mapping regulatory constraints to the Policy(agent_id, partial_path, proposed_action, org_state) formalism.

high positive Runtime Governance for AI Agents: Policies on Paths representability of regulation-inspired policies in the formalism (yes/no; examp...

Runtime policy evaluation can intercept, score, log, allow/modify/block actions, and update organizational state as part of an agent's execution loop (reference implementation architecture).

Reference implementation design described in the paper (runtime policy evaluator hooks, logging, enforcement actions); architectural reasoning and pseudo-workflows provided; no production deployment data.

high positive Runtime Governance for AI Agents: Policies on Paths feasibility of integrating runtime policy evaluator into agent loops (architectu...

Policies can be formalized as deterministic functions p_violation = Policy(agent_id, partial_path, proposed_action, org_state) that return a probability or score of violation for a proposed next action.

Formal definition and mapping in the paper; worked examples showing how regulatory-style constraints map into this function; no large-scale empirical validation.

high positive Runtime Governance for AI Agents: Policies on Paths expressiveness of policy formalism (ability to represent targeted constraints)

Effective governance for agentic LLM systems requires treating the execution path as the central object and performing runtime evaluation of proposed next actions given the partial path.

Theoretical argument and formal proposal of runtime policy evaluator that takes (agent_id, partial_path, proposed_action, org_state) and returns a violation probability; reference architecture described; illustrative examples.

high positive Runtime Governance for AI Agents: Policies on Paths governance effectiveness for path-dependent policies (qualitative/coverage)

Multiple off-the-shelf vision-language models (closed-source and open-source) representative of current state-of-the-art architectures were benchmarked.

Paper reports experiments across a mix of closed-source and open-source VLMs; exact model names provided in the released materials.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... models evaluated (variety and representativeness)

Evaluation targets include correctness, consistency, and update efficacy, operationalized via quantitative metrics (accuracy, consistency rates, update success rate).

Methods section describing evaluation metrics and how correctness, consistency, and update efficacy are measured across experiments.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... metrics used: accuracy, consistency rate, update success rate

A curated set of time-sensitive factual items (e.g., officeholders, company statuses, recent awards/results) was used to construct the benchmark.

Benchmark composition description listing categories of time-sensitive facts and methodology for curation of items used in experiments.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... composition of benchmark item set

The authors release the V-DyKnow benchmark, code, and evaluation data for community use.

Statement in paper and accompanying release materials indicating benchmark, code, and evaluation data are publicly available.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... availability of benchmark, code, and data

V-DyKnow is a benchmark specifically designed to evaluate time-sensitive factual knowledge in vision-language models across both text and image modalities.

Release and description of the benchmark in the paper: curated set of time-sensitive factual items, paired multimodal stimuli (text + images), input perturbations, and evaluation scripts. Methodological description of benchmark composition and tasks.

high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... benchmark existence / capability to evaluate time-sensitive multimodal factual k...

Ethical handling: the study involved sensitive material (self-harm, trauma) and authors applied validation and careful handling consistent with research ethics.

Ethics section and methods describing sensitivity of material and precautions taken in data handling and validation.

high positive Characterizing Delusional Spirals through Human-LLM Chat Log... ethical procedures applied to sensitive data

Selected coded items (for example, suicidal messages) were validated by the authors to increase reliability of certain critical annotations.

Methods section describing validation procedures applied to selected items such as suicidal ideation.

high positive Characterizing Delusional Spirals through Human-LLM Chat Log... validation status of coded items (e.g., number of validated suicidal messages)

The authors developed and applied a manual codebook of 28 behavioral/phenomenological codes (e.g., delusional thinking, suicidal ideation, chatbot sentience claims, romantic interest) across the full corpus.

Method section describing construction of a 28-code inventory and manual coding applied to entire dataset.

high positive Characterizing Delusional Spirals through Human-LLM Chat Log... existence and application of a 28-code annotation scheme

The surrogate-driven inverse-design pipeline transfers to physical hardware — designs produced by the CNN+GA pipeline were realized and validated experimentally.

Two fabricated prototypes implemented the optimized pixelated combiners and GaN HEMT Doherty PAs; measured performance metrics correspond to the designs, demonstrating transfer from surrogate-driven design to hardware.

high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... consistency between surrogate-driven design outputs and measured prototype perfo...

Under a 20 MHz 5G-NR-like waveform (9 dB PAPR) with digital predistortion (DPD), each prototype reached average PAE greater than 51% while meeting ACLR ≤ −60.8 dBc.

Realistic waveform testing described: a 20 MHz 5G‑NR-like signal with 9 dB PAPR was applied to the prototypes, DPD was used, and measurements reported average PAE > 51% and ACLR ≤ −60.8 dBc for each prototype.

high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... average power-added efficiency (PAE %) and adjacent channel leakage ratio (ACLR,...

Each prototype demonstrated drain efficiency greater than 52% at 9 dB back-off.

Back-off efficiency measurements reported for the fabricated prototypes showing drain efficiency > 52% at 9 dB back-off.

high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... drain efficiency at 9 dB back-off (%)

Each prototype produced output power exceeding 44.1 dBm at 2.75 GHz.

Measured output power reported from RF characterization of the two fabricated prototypes; reported value > 44.1 dBm at the test frequency.

high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... output power (dBm)

Each fabricated prototype achieved peak drain efficiency greater than 74%.

Measured RF characterization reported for the two prototypes showing peak drain efficiency > 74%; measurements conducted on fabricated hardware at 2.75 GHz.

high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... peak drain efficiency (%)

A genetic-algorithm (GA) blackbox optimizer paired with the CNN surrogate can effectively search the discrete multi-port pixel layout space to synthesize output combiners for Doherty amplifiers.

Method description: CNN surrogate embedded in a blackbox Doherty framework and used within a GA to select pixelated combiner layouts; successful designs were produced and taken to fabrication.

high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... ability of optimization stack to find feasible combiner layouts that meet system...

The parallel associative scan enables the reductions required by Newton-style updates across time steps, thereby enabling parallelism across sequence length.

Algorithmic construction and implementation details in the thesis showing how associative scan operations aggregate intermediate Jacobian/ update information across time; examples provided in implementation section.

high positive Unifying Optimization and Dynamics to Parallelize Sequential... practical parallelizability / ability to compute required reductions in parallel

The thesis proves linear convergence rates for a family of fixed-point/Newton-like solvers, with rates depending on approximation accuracy and stability properties of the chosen method.

Mathematical proofs and convergence theorems provided in the theoretical analysis section establishing linear rates under stated assumptions (bounds on approximation error, stability metrics).

high positive Unifying Optimization and Dynamics to Parallelize Sequential... convergence rate (linear) as a function of approximation error and stability mea...

Evaluation of dynamical systems can be cast as solving a system of nonlinear equations, enabling parallel solution methods.

Methodological framing and derivation in the thesis showing recurrent updates and Markov transitions can be represented as a global nonlinear root-finding problem; algorithmic constructions follow from this representation.

high positive Unifying Optimization and Dynamics to Parallelize Sequential... feasibility of parallel solution (existence of equivalent nonlinear-system formu...

Explicit enforcement of signal constraints in DeePC provides a safety/operational advantage over many pure learning approaches that do not explicitly enforce hard constraints.

Algorithmic formulation includes constraints in the optimization; paper contrasts this with unconstrained learning-based controllers and demonstrates constrained, feasible actuation in simulation.

high positive Data-driven generalized perimeter control: Zürich case study explicit constraint satisfaction and operational safety of signal timings

DeePC can compute traffic-light actuation sequences that respect hard operational and safety constraints (e.g., phasing, minimum/maximum green times).

Formulation of DeePC as a constrained optimization problem in the paper with explicit constraint terms for signal phasing and safety; implemented in simulation experiments where constraints are enforced in the controller optimization.

high positive Data-driven generalized perimeter control: Zürich case study constraint satisfaction / feasibility of computed actuation sequences

Reframing urban traffic dynamics with behavioral systems theory allows system evolution to be learned and predicted directly from measured input–output data (no explicit model identification).

Theoretical exposition in the paper showing that traffic trajectories can be represented as linear combinations of past measured trajectories via Hankel/data matrices; used as the basis for predictive control (DeePC).

high positive Data-driven generalized perimeter control: Zürich case study predictive capability from measured I/O trajectories (ability to forecast future...

Applying DeePC yields measurable improvements in system-level outcomes (reduced total travel time and CO2 emissions) in a very large, high-fidelity microscopic simulation of Zürich.

Simulation experiments in a city-scale, high-fidelity microscopic closed-loop simulator of Zürich comparing DeePC-controlled signals against baseline controllers (e.g., fixed-time or standard adaptive schemes); reported reductions in aggregated metrics (total travel time and CO2 emissions).

high positive Data-driven generalized perimeter control: Zürich case study total travel time; CO2 emissions

A model-free traffic control approach (DeePC) can steer urban traffic via dynamic traffic-light control without building explicit traffic models.

Algorithmic/theoretical development (behavioral systems theory + DeePC) and controller-in-loop experiments in a high-fidelity microscopic closed-loop simulator of Zürich demonstrating closed-loop control using only input–output trajectory data (Hankel matrices) rather than parametric model identification.

high positive Data-driven generalized perimeter control: Zürich case study ability to generate feasible control (traffic-light) actuation sequences and clo...

The model weights will be open (open-weight release) to support European sovereignty and adoption.

Authors state intent to publish open weights and position the model as an open-weight European alternative; the summary reports this as a declared objective. The paper likely includes a licensing/availability statement.

high positive EngGPT2: Sovereign, Efficient and Open Intelligence planned availability / licensing status of model weights

Calibration data must be representative of deployment data to preserve conformal statistical guarantees in practice.

Theoretical requirement of exchangeability for conformal guarantees combined with empirical results where mismatched calibration caused guarantee violations or degraded factuality.

high positive Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... preservation of factuality guarantees and post-deployment factuality

The paper introduces informativeness-aware metrics to measure task utility under conformal filtering, going beyond pure factuality rates.

Methodological contribution described: new metrics that penalize vacuous outputs and quantify retained task utility after filtering.

high positive Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... informativeness/usefulness metrics (as defined in the paper)

Decomposing generated outputs into atomic claims and calibrating a verifier score threshold on held-out data yields a statistically valid guarantee (under exchangeability) that claims passing the threshold meet a target factuality level.

Method description and theoretical use of conformal calibration applied to per-claim scores, with held-out calibration set used to set the threshold; conforms to standard conformal prediction methodology presented in the paper.

high positive Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... coverage/factuality level of claims passing threshold

Conformal factuality provides distribution-free statistical guarantees for claim-level correctness in retrieval-augmented LLM outputs.

The paper applies conformal calibration to atomic claims: decompose outputs into atomic claims, score each claim with a verifier, and calibrate a score threshold on held-out (exchangeable) data to guarantee a target claim-level factuality rate. This is a theoretical property of conformal methods described and implemented in the paper.

high positive Is Conformal Factuality for RAG-based LLMs Robust? Novel Met... claim-level factuality guarantee (probability bound on correctness of claims pas...

Traditional machine-learning baselines were included for comparison in the benchmarks.

Paper explicitly states that traditional ML baselines were used alongside TSFMs in benchmarking experiments. The summary does not list which baselines or their quantitative results.

high positive Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... inclusion of traditional ML baseline models in comparative evaluation

The dataset sampling resolution is at the millisecond level, enabling forecasting horizons from 1 step (100 ms) up to 96 steps (9.6 s).

Paper states sampling resolution is millisecond-level and defines forecasting tasks spanning 1 to 96 steps (100 ms to 9.6 s). This is a methodological description rather than an experimental metric.

high positive Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... supported forecast horizons (temporal prediction horizon: 100 ms–9.6 s)

Introduces a new millisecond-resolution dataset of wireless channel and traffic-condition measurements from an operational 5G deployment.

Paper describes collection of operational 5G telemetry at millisecond sampling resolution; dataset is presented as a novel domain addition to TSFM pretraining corpora. Exact number of records/sessions not specified in the provided summary.

high positive Bridging the High-Frequency Data Gap: A Millisecond-Resoluti... availability and characteristics of a millisecond-resolution 5G measurement data...

Under pathological label heterogeneity (mutually exclusive local labels) FederatedFactory restores CIFAR-10 classification accuracy from a collapsed baseline of 11.36% to 90.57%.

Empirical experiment reported on CIFAR-10 configured as a pathological heterogeneity stress test; paper reports baseline collapsed accuracy (11.36%) and FederatedFactory result (90.57%). (Specific sample sizes / client counts not provided in the summary.)

high positive FederatedFactory: Generative One-Shot Learning for Extremely... CIFAR-10 classification accuracy (%)

A single communication round of generative-module exchange suffices for clients to synthesize class-balanced datasets locally and align their training data.

Paper reports a single exchange of generative modules across clients (one communication round) and uses that to synthesize a globally class-balanced training set at each client; experiments (CIFAR-10, MedMNIST, ISIC2019) are run under this one-round regime.

high positive FederatedFactory: Generative One-Shot Learning for Extremely... number of communication rounds required; class balance of synthesized datasets

Convergence of the three complementary methods (lexical, paraphrase, behavioral) strengthens confidence that contamination is real and systematically inflates scores.

Triangulation across Experiment 1 (lexical detection on public corpora), Experiment 2 (paraphrase robustness on 100-question subset), and Experiment 3 (TS‑Guessing on all items); consistent patterns observed across methods.

high positive Are Large Language Models Truly Smarter Than Humans? robustness/confidence in contamination detection (methodological convergence)

All 13 surveyed generative systems report addressing syntactic validity (Layer 1).

For each of the 13 systems the review reports syntactic/parse/compile checks or token-level validity tests under Layer 1 in the systematic application of the evaluation framework.

high positive Generative AI for Quantum Circuits and Quantum Code: A Techn... reporting of syntactic validity checks

« Prev 1 2 3 … 74 75 76 … 170 171 Next »