Evidence (7953 claims)
Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 402 | 112 | 67 | 480 | 1076 |
| Governance & Regulation | 402 | 192 | 122 | 62 | 790 |
| Research Productivity | 249 | 98 | 34 | 311 | 697 |
| Organizational Efficiency | 395 | 95 | 70 | 40 | 603 |
| Technology Adoption Rate | 321 | 126 | 73 | 39 | 564 |
| Firm Productivity | 306 | 39 | 70 | 12 | 432 |
| Output Quality | 256 | 66 | 25 | 28 | 375 |
| AI Safety & Ethics | 116 | 177 | 44 | 24 | 363 |
| Market Structure | 107 | 128 | 85 | 14 | 339 |
| Decision Quality | 177 | 76 | 38 | 20 | 315 |
| Fiscal & Macroeconomic | 89 | 58 | 33 | 22 | 209 |
| Employment Level | 77 | 34 | 80 | 9 | 202 |
| Skill Acquisition | 92 | 33 | 40 | 9 | 174 |
| Innovation Output | 120 | 12 | 23 | 12 | 168 |
| Firm Revenue | 98 | 34 | 22 | — | 154 |
| Consumer Welfare | 73 | 31 | 37 | 7 | 148 |
| Task Allocation | 84 | 16 | 33 | 7 | 140 |
| Inequality Measures | 25 | 77 | 32 | 5 | 139 |
| Regulatory Compliance | 54 | 63 | 13 | 3 | 133 |
| Error Rate | 44 | 51 | 6 | — | 101 |
| Task Completion Time | 88 | 5 | 4 | 3 | 100 |
| Training Effectiveness | 58 | 12 | 12 | 16 | 99 |
| Worker Satisfaction | 47 | 32 | 11 | 7 | 97 |
| Wages & Compensation | 53 | 15 | 20 | 5 | 93 |
| Team Performance | 47 | 12 | 15 | 7 | 82 |
| Automation Exposure | 24 | 22 | 9 | 6 | 62 |
| Job Displacement | 6 | 38 | 13 | — | 57 |
| Hiring & Recruitment | 41 | 4 | 6 | 3 | 54 |
| Developer Productivity | 34 | 4 | 3 | 1 | 42 |
| Social Protection | 22 | 10 | 6 | 2 | 40 |
| Creative Output | 16 | 7 | 5 | 1 | 29 |
| Labor Share of Income | 12 | 5 | 9 | — | 26 |
| Skill Obsolescence | 3 | 20 | 2 | — | 25 |
| Worker Turnover | 10 | 12 | — | 3 | 25 |
LEAFE converts rich environment feedback into actionable corrective supervision rather than optimizing only final success signals, which drives performance gains.
Algorithmic description: LEAFE summarizes error messages/intermediate observations into experience items, backtracks to causal decision points, explores corrective branches, and distills corrected trajectories via supervised fine-tuning. Empirical comparisons show improved Pass@k relative to reward-only/outcome-driven baselines.
Open dataset and code improve reproducibility and lower barriers for follow-up work on applied LLM tools and economic impact studies.
Release of SlideRL dataset (288 rollouts) and code repository; general statement about reproducibility benefits.
Parameter-efficient RL fine-tuning (0.5% of params) can yield large quality gains, implying a potentially high ROI for targeted fine-tuning versus full-model scaling.
Observed empirical gain of +33.1% for the tuned 7B over its untuned base and the 91.2% relative performance vs Claude Opus 4.6; implication drawn about cost-effectiveness of tuning few parameters rather than scaling model size.
The inverse-specification reward—where an LLM attempts to recover the original brief from generated slides—provides a holistic fidelity signal.
Reward design: inverse-specification component implemented and used as part of composite reward; claimed to measure fidelity via recovery accuracy.
Performance on this agentic slide-generation task is driven more by instruction adherence and tool-use compliance than by raw model parameter count.
Cross-model comparison across six models on the 48-task benchmark, with analyses showing instruction adherence and tool-use compliance better predict agent performance than parameter count.
Adoption will shift labor demand toward expertise in deterministic capture/replay tooling, trace analytics, and integration automation.
Economic/organizational implication discussed in the summary; no employment-data analysis provided—stated as an expected change in skill demand.
The approach improves utilization and ROI of expensive emulation/simulation resources by enabling reuse of deterministic traces across platforms.
Implication drawn from being able to replay identical traces on both simulator and emulator; no direct financial ROI calculation or utilization metrics provided in the summary.
Using replay-driven validation markedly shortens integration and debug cycles for the demonstrated chiplet subsystem, enabling end-to-end system boot and workload execution within a single quarter.
Reported outcome for the ODIN SoC building block: authors state they were able to reach full system boot and run workloads within one quarter of integration using the methodology. (Single-case timeline reported; no control/comparison group or statistical analysis provided.)
Replay-driven validation made previously hard-to-reproduce interactions and bugs deterministic and repeatable at system level, enabling more focused and efficient debug.
Authors report that deterministic capture/replay converted non-deterministic protocol interactions and transient bugs into repeatable traces that could be inspected and debugged; examples include complex GPU workloads and protocol sequences reproduced end-to-end. (Qualitative/process-level evidence from the demonstrator; no numerical bug-count reduction provided.)
A replay-driven validation methodology using deterministic waveform capture and replay from a single design database enables reliable, repeatable system-level reproduction of complex GPU workloads and protocol sequences for tightly coupled CPU–GPU chiplet subsystems.
Applied to a demonstrator SoC building block (ODIN chiplet architecture) integrating a CPU subsystem, multiple Intel Xe GPU cores, and a configurable NoC; deterministic waveform capture during execution and deterministic replay of those waveforms across targets was performed; same design database used to manage captures, traces, and replay sessions. (No large-sample statistical evaluation reported; demonstration limited to the described system.)
Overall conclusion: forecast-then-execute (anticipatory trajectory reasoning) is an effective principle for building multimodal agents capable of reasoning, planning, and acting in complex environments.
Paper's Conclusion in the provided summary asserts this, based on the reported experimental comparisons and the two-stage TraceR1 framework.
The paper reports improvements in planning stability (consistency of multi-step plans), execution robustness (success under environment/tool variability), and generalization (out-of-distribution tasks and unseen tool/environment states).
Reported outcomes in the summary explicitly list these three improvement categories; the specific metrics and magnitudes are not provided in the summary.
Compared to reactive agents that optimize actions stepwise without trajectory anticipation, TraceR1 yields better multi-step planning and execution.
Baselines & comparisons described in the summary include reactive agents; the paper reports improvements of TraceR1 relative to these baselines across the benchmarks (no numeric values in the provided text).
Explicit anticipatory (trajectory-level) reasoning is a crucial design principle for reliable multi-step task performance in complex real-world environments.
Paper reports comparisons between anticipatory (trajectory-forecasting) agents and reactive / single-stage baselines, concluding the anticipatory design yields better multi-step reliability; exact experimental details and statistics not included in the provided summary.
TraceR1 materially improves planning coherence, execution robustness, and generalization in multimodal, tool-using agents versus reactive or single-stage baselines.
Reported evaluation across seven benchmarks (online and offline computer-use, multimodal tool-use reasoning) comparing TraceR1 to reactive agents and single-stage RL baselines; summary states 'substantial gains' though no numerical results are provided in the provided text.
The proposed algorithm's performance is robust to heterogeneous populations in the synthetic experiments (i.e., it continues to find core alternatives under varying degrees of population heterogeneity).
Empirical robustness checks reported in the experiments where population heterogeneity is varied and performance (core-attainment frequency) is evaluated.
The authors compare their sampling algorithm against classical social-choice rules and LLM-based heuristics and report superior core-attainment frequency for their method.
Experimental comparisons described in the paper between the proposed algorithm and baseline methods (classical social-choice rules, LLM-based heuristics) on the synthetic dataset; results summarized in the experiments section.
On a synthetic text-preference dataset, the proposed algorithm reliably finds alternatives that lie in the proportional veto core.
Empirical experiments reported in the paper using a synthetic dataset of text preferences; evaluation metric reported as frequency (proportion) of runs where the returned alternative is in the proportional veto core.
Temporal grounding (restricting models to contemporaneous information) should be adopted as a methodological best practice in economic research using LLMs to avoid leakage and produce more realistic assessments of model forecasting ability.
Study methodology and rationale emphasize temporal grounding; authors recommend it as best practice based on the observed benefits in reducing retrospective contamination.
Because the conflict unfolded after the training cutoffs of contemporary frontier LLMs, the dataset and analyses provide an archival, hindsight-free benchmark for studying model reasoning.
Case selection rationale: the 2026 Middle East conflict was deliberately chosen because it occurred after the training cutoffs of the evaluated frontier models; dataset preserves contemporaneous queries and model outputs.
Frontier large language models (LLMs) can reason about an unfolding geopolitical crisis using only contemporaneous public information, often demonstrating strategic realism (inferring underlying structural incentives beyond surface rhetoric).
Evaluation across 11 temporally defined nodes during the early 2026 Middle East conflict using 42 node-specific verifiable questions and 5 exploratory prompts; results assessed via verifiability checks and qualitative coding for strategic reasoning of outputs from contemporary frontier LLMs constrained to contemporaneous information.
BATQuant establishes new state-of-the-art results across multimodal benchmarks for MXFP4-aware PTQ under aggressive quantization.
Comparative benchmark results reported in the paper showing BATQuant outperforming prior PTQ methods on the described multimodal benchmarks (specific benchmark names and quantitative margins not provided in the summary).
Ablation analyses show that each BATQuant component (block-wise transforms, orthogonality relaxation, GPK decomposition, block-wise clipping) contributes to robustness and efficiency.
Reported ablation studies isolating components and measuring their individual impact on performance and overhead in the paper's experiments (exact effect sizes and per-component numbers not given in the summary).
Block-wise learnable clipping suppresses residual outliers locally and contributes to robustness under aggressive MXFP4 quantization.
Method description and ablation experiments in the paper showing incremental improvement when adding block-wise learnable clipping layers versus not using them; improvements measured on benchmark metrics post-quantization.
Global and Private Kronecker (GPK) decomposition compresses transform parameters, keeping storage and runtime overhead low compared to dense per-block transforms.
Algorithmic contribution described in the paper with reported comparisons (storage/runtime overhead) versus dense per-block transform parameterizations; supported by experimental/implementation measurements (specific memory/runtime numbers not provided in the summary).
Relaxing orthogonality constraints on transforms (i.e., using non-strictly-orthogonal transforms) improves distribution shaping and better fits activations to the limited MXFP quantization range.
Design rationale and ablation studies reported in the paper showing that removing strict orthogonality yields better quantization fit and improved task metrics versus enforced orthogonal transforms.
Aligning transforms to MXFP block granularity using block-wise affine transformations prevents cross-block outlier propagation and avoids the severe collapse seen with rotation-based integer quantization techniques.
Methodological design plus ablation/empirical results in the paper showing improved activation statistics and preserved model accuracy when using block-wise affine transforms aligned to MXFP blocks versus global rotations.
Standardized runtime governance frameworks could lower per-deployment compliance engineering costs and increase diffusion of agentic systems.
Theoretical argument that standardization reduces transaction/engineering costs; suggested market dynamics; no empirical implementation evidence.
A market will develop for third-party governance tools, auditors, and insurers providing policy evaluators, risk calibration, and certification services.
Economic argument and analogy to existing markets (governance-as-a-service, insurance); no empirical evidence presented.
Benchmarking time-sensitivity (via V-DyKnow) can inform procurement decisions: buyers should assess models on their ability to handle temporally sensitive information, not just static benchmarks.
Paper's recommendations and implications section arguing for procurement practices informed by V-DyKnow evaluations.
The authors provide an operational inventory and conversation-analysis tool (the 28-code instrument) that can be reused for monitoring and mitigation by researchers, firms, and regulators.
Paper includes the codebook and describes its application as a re-usable monitoring/analysis instrument; proposed adoption discussed in implications.
This is the first empirical, message-level study of verified chatbot-related psychological-harm cases (as opposed to speculative discussion).
Authors' positioning in paper; claim of novelty based on review of prior literature and their message-level, verified-case approach.
The authors synthesized complex three-port pixelated output combiners that extend efficiency over back-off using fully symmetrical device implementations.
Design novelty claimed in paper; resulting three-port pixelated combiner layouts were included in the optimization output and used in prototypes. Prototypes used symmetrical device implementations.
The CNN EM surrogate enables orders-of-magnitude faster evaluations than full-wave EM simulation, enabling global search of the discrete pixel design space.
Authors state the surrogate provides orders-of-magnitude speedups compared to full-wave EM, enabling global search; no quantitative speedup numbers or benchmarking details are provided in the provided summary.
A deep convolutional neural network (CNN) trained as an electromagnetic (EM) surrogate can predict S-parameters of pixelated passive networks quickly and with sufficient accuracy to be used inside an optimizer loop.
Paper reports development and use of a CNN surrogate mapping pixelated network layouts to S-parameters; the surrogate was embedded in the optimizer and used to evaluate candidate layouts during global search. (Note: exact training dataset size, architecture, and error metrics are not provided in the summary.)
Empirical evaluation shows the new quasi‑Newton and trust‑region methods outperform baseline sequential methods and prior parallel Newton variants in a combination of speed, memory, stability, and convergence on the tested tasks.
Reported experiments comparing the proposed algorithms to sequential baselines and prior parallel Newton approaches on representative tasks (RNNs, MCMC); qualitative summary claims faster runtimes, lower memory, and improved stability.
Trust-region methods provide stability and improved convergence reliability across tested tasks.
Empirical comparisons and algorithmic analysis showing trust-region-enabled schemes had fewer divergences and more reliable convergence than prior parallel Newton variants in the evaluated workloads.
Quasi-Newton methods deliver faster runtimes and lower memory use in experiments on RNN inference/training and MCMC chains.
Empirical experiments comparing quasi-Newton implementations to full Newton and sequential baselines on representative tasks (explicit tasks listed: RNN inference/training and MCMC chains); reported qualitative outcomes indicate speed and memory advantages.
Trust-region variants substantially improve stability and robustness, addressing divergence issues of earlier parallel Newton implementations.
Presentation of trust-region schemes adapting step sizes within the parallel Newton framework; theoretical motivation and empirical results showing reduced divergence/failure rates compared to prior parallel Newton variants.
Quasi-Newton variants are more computationally efficient and memory friendly than full Newton.
Complexity and memory analyses in the thesis plus empirical comparisons on representative tasks (RNNs, MCMC) showing lower runtime and memory usage for quasi-Newton implementations versus full Newton.
A Parallel Newton framework, implemented with a parallel associative scan, provides a natural way to parallelize computations across sequence length.
Algorithmic design combining Newton updates with a parallel associative-scan reduction; implementation details and experiments demonstrating the mechanics of the parallel scan across time steps.
Parallel Newton methods can reliably and efficiently parallelize sequential dynamical systems (e.g., RNNs, MCMC) across sequence length when reframed as nonlinear equation solves.
Thesis presents a reformulation of sequence computation as a global nonlinear system, develops parallel Newton-style algorithms, and reports empirical experiments on representative tasks (RNN inference/training and MCMC chains) comparing runtime and convergence against sequential baselines and prior parallel Newton variants.
Adopting this approach shifts required skills and organizational roles away from lengthy parametric modeling toward data engineering, controller integration, and monitoring.
Authors' discussion of practical/organizational implications (qualitative); argument based on removal of model-building step and increased emphasis on data infrastructure and online operations.
DeePC outperforms baseline controllers (e.g., fixed-time and standard adaptive schemes) in the simulated experiments.
Comparative simulation experiments reported in the paper where DeePC-controlled signals achieve superior system-level metrics relative to baseline controllers.
The method was validated on a very large, high-fidelity microscopic closed-loop simulator of Zürich; the paper reports this as the largest such closed-loop urban-traffic simulation in the literature.
Authors' description of the experimental environment: city-scale microscopic simulator of Zürich with controller in the loop; explicit statement in the paper claiming it is the largest closed-loop urban-traffic simulation reported in the literature.
Regularization and the use of measured Hankel/data matrices make the method more robust to measurement noise and limited data.
Method description includes regularization terms in the DeePC optimization and use of Hankel matrices built from measured trajectories; simulation experiments show continued performance under noisy / limited-data conditions.
DeePC handles sparse or limited traffic measurements better than many machine-learning methods.
Claims in the paper supported by experiments and methodological notes: use of Hankel structures and regularization in DeePC to operate with limited/sparse sensing; comparative statements versus generic ML methods (qualitative and simulation evidence).
The DeePC-based approach avoids the expensive, time-consuming model-building step required by model-based control methods.
Methodological argument and demonstration that controller uses historical input–output trajectories directly rather than requiring separate parametric model identification; supported by simulation implementation that bypasses model identification.
Legible decision modes and recorded contest pathways improve verifiability and lower information asymmetries, aiding regulators and platforms in monitoring and reducing litigation/reputational risk.
Analytic claim in the implications section; argued conceptually and tied to proposed logging/audit tools; no empirical validation.
The pattern can reduce costly misallocations caused by LLM unpredictability by constraining policy options, improving overall allocation efficiency in expectation.
Theoretical argument in the paper tying constrained policy space to reduced variability and misallocation risk; no empirical testing or quantitative model provided.