Evidence (8542 claims)
Adoption
5831 claims
Productivity
5063 claims
Governance
4582 claims
Human-AI Collaboration
3625 claims
Labor Markets
2749 claims
Innovation
2704 claims
Org Design
2667 claims
Skills & Training
2126 claims
Inequality
1429 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 448 | 118 | 70 | 511 | 1163 |
| Governance & Regulation | 458 | 217 | 125 | 67 | 884 |
| Research Productivity | 274 | 103 | 35 | 303 | 720 |
| Organizational Efficiency | 444 | 106 | 78 | 43 | 675 |
| Technology Adoption Rate | 347 | 130 | 76 | 45 | 603 |
| Firm Productivity | 324 | 39 | 73 | 13 | 454 |
| Output Quality | 273 | 76 | 27 | 30 | 406 |
| AI Safety & Ethics | 122 | 188 | 46 | 27 | 385 |
| Market Structure | 119 | 134 | 86 | 14 | 358 |
| Decision Quality | 182 | 79 | 41 | 20 | 326 |
| Fiscal & Macroeconomic | 95 | 58 | 34 | 22 | 216 |
| Employment Level | 78 | 37 | 80 | 9 | 206 |
| Skill Acquisition | 104 | 37 | 41 | 9 | 191 |
| Innovation Output | 127 | 12 | 26 | 14 | 180 |
| Firm Revenue | 101 | 38 | 24 | — | 163 |
| Task Allocation | 95 | 18 | 36 | 8 | 159 |
| Consumer Welfare | 77 | 38 | 37 | 7 | 159 |
| Inequality Measures | 29 | 81 | 33 | 6 | 149 |
| Regulatory Compliance | 54 | 61 | 13 | 3 | 131 |
| Task Completion Time | 92 | 8 | 4 | 3 | 107 |
| Worker Satisfaction | 49 | 36 | 13 | 8 | 106 |
| Error Rate | 45 | 53 | 6 | — | 104 |
| Training Effectiveness | 60 | 13 | 12 | 16 | 102 |
| Wages & Compensation | 56 | 16 | 20 | 5 | 97 |
| Team Performance | 51 | 13 | 15 | 8 | 88 |
| Automation Exposure | 28 | 29 | 12 | 7 | 79 |
| Job Displacement | 7 | 45 | 13 | — | 65 |
| Hiring & Recruitment | 42 | 4 | 7 | 3 | 56 |
| Developer Productivity | 38 | 5 | 4 | 3 | 50 |
| Social Protection | 22 | 12 | 7 | 2 | 43 |
| Creative Output | 17 | 8 | 6 | 1 | 32 |
| Skill Obsolescence | 3 | 26 | 2 | — | 31 |
| Labor Share of Income | 12 | 7 | 10 | — | 29 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
The RL fine-tuned Qwen2.5-Coder-7B improves 33.1% over the same base 7B model without RL fine-tuning.
Head-to-head comparison between the tuned model and its untuned base across the 48 evaluation briefs; reported improvement of +33.1%.
Fine-tuning a parameter-efficient 7B model (Qwen2.5-Coder-7B) via reinforcement learning in an OpenEnv-compatible environment yields near-state-of-the-art automated slide-generation: the tuned 7B model reaches 91.2% of Claude Opus 4.6’s quality.
Empirical evaluation on 48 diverse business briefs comparing six models; reported relative quality score of tuned Qwen2.5-Coder-7B = 91.2% of Claude Opus 4.6.
Managing captures, traces, and replay sessions from a unified single design database ensures consistency across replay targets and sessions.
Method description emphasizes a single design database coordinating captures and replays across simulation and emulation for the demonstrator system. (Operational claim demonstrated in the implementation; no metrics on error reduction provided.)
The captured traces can be deterministically replayed across different execution targets (software/hardware simulation and hardware emulation), reducing cross-platform setup complexity and discrepancies.
The same captured waveforms/traces were replayed on both simulation and emulation environments for the ODIN demonstrator; cross-target replay was part of the described method. (Demonstrated on the single reported system; no broad cross-toolchain study provided.)
Using the proportional veto core provides formal protection for minority blocs by giving them proportional blocking power, thus encoding a proportional fairness guarantee compared to simple majoritarian rules.
Definition and properties of the proportional veto core presented in the paper; conceptual discussion comparing veto/proportionality guarantees to majoritarian outcomes.
The paper characterizes the information cost of aggregating preferences when AI can generate essentially unlimited candidate alternatives by providing tight sample-complexity bounds and lower bounds.
The combination of sampling-model formalization, sample-complexity upper bounds, and matching lower bounds constitutes a formal characterization of the information (sample) requirements.
The authors prove an upper bound on the number of samples/queries required by their algorithm as a function of accuracy, confidence, and problem parameters.
Theoretical analysis in the paper deriving explicit sample-complexity upper bounds (stated as functions of accuracy/confidence and relevant parameters).
Under only query (sampling) access to the unknown joint distribution of voters and alternatives, there is an efficient sampling-based algorithm that, with high probability, returns an alternative in the approximate proportional veto core.
Constructive algorithm and correctness proof in the paper showing the algorithm returns an approximate core alternative with high probability under the sampling access model.
The paper formalizes the proportional veto core for settings with an infinite alternative space and voters whose preferences are drawn from an unknown distribution.
Formal model and definitions presented in the paper: extension of the proportional veto core to an infinite alternative space and definitions for sampling-appropriate approximate proportional veto core.
Temporally grounding model inputs (constraining models to contemporaneous public information at each node) substantially reduces the risk of training-data leakage and hindsight bias.
Study design enforced node-specific contemporaneous evidence constraints for each of the 11 nodes; methodological rationale and comparison to unconstrained settings described as reducing retrospective information contamination.
BATQuant significantly outperforms prior post-training quantization (PTQ) methods on MXFP4 microscaling floating-point formats under aggressive quantization.
Comparative experiments against rotation-based PTQ techniques and other existing PTQ baselines on the described multimodal and language tasks; improvements shown in benchmark metrics and recovery percentages in the paper's experimental section.
BATQuant recovers up to 96.43% of full-precision performance under aggressive W4A4KV16 quantization on MLLMs and LLMs.
Empirical evaluation reported in the paper: experiments on multiple multimodal large language models (MLLMs) and standard LLMs using an aggressive W4A4KV16 quantization setup; performance reported as percentage of full-precision performance recovered (specific models, benchmark names, and exact sample sizes not enumerated in the summary).
The paper provides concrete, regulation-inspired policy examples (e.g., content prohibition, sensitive data exfiltration) showing how they map into the Policy function.
Worked, illustrative examples included in the paper mapping regulatory constraints to the Policy(agent_id, partial_path, proposed_action, org_state) formalism.
Runtime policy evaluation can intercept, score, log, allow/modify/block actions, and update organizational state as part of an agent's execution loop (reference implementation architecture).
Reference implementation design described in the paper (runtime policy evaluator hooks, logging, enforcement actions); architectural reasoning and pseudo-workflows provided; no production deployment data.
Policies can be formalized as deterministic functions p_violation = Policy(agent_id, partial_path, proposed_action, org_state) that return a probability or score of violation for a proposed next action.
Formal definition and mapping in the paper; worked examples showing how regulatory-style constraints map into this function; no large-scale empirical validation.
Effective governance for agentic LLM systems requires treating the execution path as the central object and performing runtime evaluation of proposed next actions given the partial path.
Theoretical argument and formal proposal of runtime policy evaluator that takes (agent_id, partial_path, proposed_action, org_state) and returns a violation probability; reference architecture described; illustrative examples.
Multiple off-the-shelf vision-language models (closed-source and open-source) representative of current state-of-the-art architectures were benchmarked.
Paper reports experiments across a mix of closed-source and open-source VLMs; exact model names provided in the released materials.
Evaluation targets include correctness, consistency, and update efficacy, operationalized via quantitative metrics (accuracy, consistency rates, update success rate).
Methods section describing evaluation metrics and how correctness, consistency, and update efficacy are measured across experiments.
A curated set of time-sensitive factual items (e.g., officeholders, company statuses, recent awards/results) was used to construct the benchmark.
Benchmark composition description listing categories of time-sensitive facts and methodology for curation of items used in experiments.
The authors release the V-DyKnow benchmark, code, and evaluation data for community use.
Statement in paper and accompanying release materials indicating benchmark, code, and evaluation data are publicly available.
V-DyKnow is a benchmark specifically designed to evaluate time-sensitive factual knowledge in vision-language models across both text and image modalities.
Release and description of the benchmark in the paper: curated set of time-sensitive factual items, paired multimodal stimuli (text + images), input perturbations, and evaluation scripts. Methodological description of benchmark composition and tasks.
Ethical handling: the study involved sensitive material (self-harm, trauma) and authors applied validation and careful handling consistent with research ethics.
Ethics section and methods describing sensitivity of material and precautions taken in data handling and validation.
Selected coded items (for example, suicidal messages) were validated by the authors to increase reliability of certain critical annotations.
Methods section describing validation procedures applied to selected items such as suicidal ideation.
The authors developed and applied a manual codebook of 28 behavioral/phenomenological codes (e.g., delusional thinking, suicidal ideation, chatbot sentience claims, romantic interest) across the full corpus.
Method section describing construction of a 28-code inventory and manual coding applied to entire dataset.
The surrogate-driven inverse-design pipeline transfers to physical hardware — designs produced by the CNN+GA pipeline were realized and validated experimentally.
Two fabricated prototypes implemented the optimized pixelated combiners and GaN HEMT Doherty PAs; measured performance metrics correspond to the designs, demonstrating transfer from surrogate-driven design to hardware.
Under a 20 MHz 5G-NR-like waveform (9 dB PAPR) with digital predistortion (DPD), each prototype reached average PAE greater than 51% while meeting ACLR ≤ −60.8 dBc.
Realistic waveform testing described: a 20 MHz 5G‑NR-like signal with 9 dB PAPR was applied to the prototypes, DPD was used, and measurements reported average PAE > 51% and ACLR ≤ −60.8 dBc for each prototype.
Each prototype demonstrated drain efficiency greater than 52% at 9 dB back-off.
Back-off efficiency measurements reported for the fabricated prototypes showing drain efficiency > 52% at 9 dB back-off.
Each prototype produced output power exceeding 44.1 dBm at 2.75 GHz.
Measured output power reported from RF characterization of the two fabricated prototypes; reported value > 44.1 dBm at the test frequency.
Each fabricated prototype achieved peak drain efficiency greater than 74%.
Measured RF characterization reported for the two prototypes showing peak drain efficiency > 74%; measurements conducted on fabricated hardware at 2.75 GHz.
A genetic-algorithm (GA) blackbox optimizer paired with the CNN surrogate can effectively search the discrete multi-port pixel layout space to synthesize output combiners for Doherty amplifiers.
Method description: CNN surrogate embedded in a blackbox Doherty framework and used within a GA to select pixelated combiner layouts; successful designs were produced and taken to fabrication.
The parallel associative scan enables the reductions required by Newton-style updates across time steps, thereby enabling parallelism across sequence length.
Algorithmic construction and implementation details in the thesis showing how associative scan operations aggregate intermediate Jacobian/ update information across time; examples provided in implementation section.
The thesis proves linear convergence rates for a family of fixed-point/Newton-like solvers, with rates depending on approximation accuracy and stability properties of the chosen method.
Mathematical proofs and convergence theorems provided in the theoretical analysis section establishing linear rates under stated assumptions (bounds on approximation error, stability metrics).
Evaluation of dynamical systems can be cast as solving a system of nonlinear equations, enabling parallel solution methods.
Methodological framing and derivation in the thesis showing recurrent updates and Markov transitions can be represented as a global nonlinear root-finding problem; algorithmic constructions follow from this representation.
Explicit enforcement of signal constraints in DeePC provides a safety/operational advantage over many pure learning approaches that do not explicitly enforce hard constraints.
Algorithmic formulation includes constraints in the optimization; paper contrasts this with unconstrained learning-based controllers and demonstrates constrained, feasible actuation in simulation.
DeePC can compute traffic-light actuation sequences that respect hard operational and safety constraints (e.g., phasing, minimum/maximum green times).
Formulation of DeePC as a constrained optimization problem in the paper with explicit constraint terms for signal phasing and safety; implemented in simulation experiments where constraints are enforced in the controller optimization.
Reframing urban traffic dynamics with behavioral systems theory allows system evolution to be learned and predicted directly from measured input–output data (no explicit model identification).
Theoretical exposition in the paper showing that traffic trajectories can be represented as linear combinations of past measured trajectories via Hankel/data matrices; used as the basis for predictive control (DeePC).
Applying DeePC yields measurable improvements in system-level outcomes (reduced total travel time and CO2 emissions) in a very large, high-fidelity microscopic simulation of Zürich.
Simulation experiments in a city-scale, high-fidelity microscopic closed-loop simulator of Zürich comparing DeePC-controlled signals against baseline controllers (e.g., fixed-time or standard adaptive schemes); reported reductions in aggregated metrics (total travel time and CO2 emissions).
A model-free traffic control approach (DeePC) can steer urban traffic via dynamic traffic-light control without building explicit traffic models.
Algorithmic/theoretical development (behavioral systems theory + DeePC) and controller-in-loop experiments in a high-fidelity microscopic closed-loop simulator of Zürich demonstrating closed-loop control using only input–output trajectory data (Hankel matrices) rather than parametric model identification.
The model weights will be open (open-weight release) to support European sovereignty and adoption.
Authors state intent to publish open weights and position the model as an open-weight European alternative; the summary reports this as a declared objective. The paper likely includes a licensing/availability statement.
Calibration data must be representative of deployment data to preserve conformal statistical guarantees in practice.
Theoretical requirement of exchangeability for conformal guarantees combined with empirical results where mismatched calibration caused guarantee violations or degraded factuality.
The paper introduces informativeness-aware metrics to measure task utility under conformal filtering, going beyond pure factuality rates.
Methodological contribution described: new metrics that penalize vacuous outputs and quantify retained task utility after filtering.
Decomposing generated outputs into atomic claims and calibrating a verifier score threshold on held-out data yields a statistically valid guarantee (under exchangeability) that claims passing the threshold meet a target factuality level.
Method description and theoretical use of conformal calibration applied to per-claim scores, with held-out calibration set used to set the threshold; conforms to standard conformal prediction methodology presented in the paper.
Conformal factuality provides distribution-free statistical guarantees for claim-level correctness in retrieval-augmented LLM outputs.
The paper applies conformal calibration to atomic claims: decompose outputs into atomic claims, score each claim with a verifier, and calibrate a score threshold on held-out (exchangeable) data to guarantee a target claim-level factuality rate. This is a theoretical property of conformal methods described and implemented in the paper.
Traditional machine-learning baselines were included for comparison in the benchmarks.
Paper explicitly states that traditional ML baselines were used alongside TSFMs in benchmarking experiments. The summary does not list which baselines or their quantitative results.
The dataset sampling resolution is at the millisecond level, enabling forecasting horizons from 1 step (100 ms) up to 96 steps (9.6 s).
Paper states sampling resolution is millisecond-level and defines forecasting tasks spanning 1 to 96 steps (100 ms to 9.6 s). This is a methodological description rather than an experimental metric.
Introduces a new millisecond-resolution dataset of wireless channel and traffic-condition measurements from an operational 5G deployment.
Paper describes collection of operational 5G telemetry at millisecond sampling resolution; dataset is presented as a novel domain addition to TSFM pretraining corpora. Exact number of records/sessions not specified in the provided summary.
Under pathological label heterogeneity (mutually exclusive local labels) FederatedFactory restores CIFAR-10 classification accuracy from a collapsed baseline of 11.36% to 90.57%.
Empirical experiment reported on CIFAR-10 configured as a pathological heterogeneity stress test; paper reports baseline collapsed accuracy (11.36%) and FederatedFactory result (90.57%). (Specific sample sizes / client counts not provided in the summary.)
A single communication round of generative-module exchange suffices for clients to synthesize class-balanced datasets locally and align their training data.
Paper reports a single exchange of generative modules across clients (one communication round) and uses that to synthesize a globally class-balanced training set at each client; experiments (CIFAR-10, MedMNIST, ISIC2019) are run under this one-round regime.
Convergence of the three complementary methods (lexical, paraphrase, behavioral) strengthens confidence that contamination is real and systematically inflates scores.
Triangulation across Experiment 1 (lexical detection on public corpora), Experiment 2 (paraphrase robustness on 100-question subset), and Experiment 3 (TS‑Guessing on all items); consistent patterns observed across methods.
All 13 surveyed generative systems report addressing syntactic validity (Layer 1).
For each of the 13 systems the review reports syntactic/parse/compile checks or token-level validity tests under Layer 1 in the systematic application of the evaluation framework.