The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (7953 claims)

Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 402 112 67 480 1076
Governance & Regulation 402 192 122 62 790
Research Productivity 249 98 34 311 697
Organizational Efficiency 395 95 70 40 603
Technology Adoption Rate 321 126 73 39 564
Firm Productivity 306 39 70 12 432
Output Quality 256 66 25 28 375
AI Safety & Ethics 116 177 44 24 363
Market Structure 107 128 85 14 339
Decision Quality 177 76 38 20 315
Fiscal & Macroeconomic 89 58 33 22 209
Employment Level 77 34 80 9 202
Skill Acquisition 92 33 40 9 174
Innovation Output 120 12 23 12 168
Firm Revenue 98 34 22 154
Consumer Welfare 73 31 37 7 148
Task Allocation 84 16 33 7 140
Inequality Measures 25 77 32 5 139
Regulatory Compliance 54 63 13 3 133
Error Rate 44 51 6 101
Task Completion Time 88 5 4 3 100
Training Effectiveness 58 12 12 16 99
Worker Satisfaction 47 32 11 7 97
Wages & Compensation 53 15 20 5 93
Team Performance 47 12 15 7 82
Automation Exposure 24 22 9 6 62
Job Displacement 6 38 13 57
Hiring & Recruitment 41 4 6 3 54
Developer Productivity 34 4 3 1 42
Social Protection 22 10 6 2 40
Creative Output 16 7 5 1 29
Labor Share of Income 12 5 9 26
Skill Obsolescence 3 20 2 25
Worker Turnover 10 12 3 25
LEAFE converts rich environment feedback into actionable corrective supervision rather than optimizing only final success signals, which drives performance gains.
Algorithmic description: LEAFE summarizes error messages/intermediate observations into experience items, backtracks to causal decision points, explores corrective branches, and distills corrected trajectories via supervised fine-tuning. Empirical comparisons show improved Pass@k relative to reward-only/outcome-driven baselines.
medium positive Internalizing Agency from Reflective Experience Pass@k performance; also qualitative measure of learned recovery behavior (impli...
Open dataset and code improve reproducibility and lower barriers for follow-up work on applied LLM tools and economic impact studies.
Release of SlideRL dataset (288 rollouts) and code repository; general statement about reproducibility benefits.
medium positive Learning to Present: Inverse Specification Rewards for Agent... Availability of artifacts that can be used to reproduce/extend the work
Parameter-efficient RL fine-tuning (0.5% of params) can yield large quality gains, implying a potentially high ROI for targeted fine-tuning versus full-model scaling.
Observed empirical gain of +33.1% for the tuned 7B over its untuned base and the 91.2% relative performance vs Claude Opus 4.6; implication drawn about cost-effectiveness of tuning few parameters rather than scaling model size.
medium positive Learning to Present: Inverse Specification Rewards for Agent... Quality gains after parameter-efficient fine-tuning and implied cost-effectivene...
The inverse-specification reward—where an LLM attempts to recover the original brief from generated slides—provides a holistic fidelity signal.
Reward design: inverse-specification component implemented and used as part of composite reward; claimed to measure fidelity via recovery accuracy.
medium positive Learning to Present: Inverse Specification Rewards for Agent... Accuracy of recovering original brief from generated slides (used as fidelity si...
Performance on this agentic slide-generation task is driven more by instruction adherence and tool-use compliance than by raw model parameter count.
Cross-model comparison across six models on the 48-task benchmark, with analyses showing instruction adherence and tool-use compliance better predict agent performance than parameter count.
medium positive Learning to Present: Inverse Specification Rewards for Agent... Predictive strength (correlation/importance) of instruction adherence and tool-u...
Adoption will shift labor demand toward expertise in deterministic capture/replay tooling, trace analytics, and integration automation.
Economic/organizational implication discussed in the summary; no employment-data analysis provided—stated as an expected change in skill demand.
medium positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... change in required engineering skill sets and labor demand
The approach improves utilization and ROI of expensive emulation/simulation resources by enabling reuse of deterministic traces across platforms.
Implication drawn from being able to replay identical traces on both simulator and emulator; no direct financial ROI calculation or utilization metrics provided in the summary.
medium positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... emulation/simulation resource utilization and implied ROI (qualitative)
Using replay-driven validation markedly shortens integration and debug cycles for the demonstrated chiplet subsystem, enabling end-to-end system boot and workload execution within a single quarter.
Reported outcome for the ODIN SoC building block: authors state they were able to reach full system boot and run workloads within one quarter of integration using the methodology. (Single-case timeline reported; no control/comparison group or statistical analysis provided.)
medium positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... integration cycle time (time to end-to-end boot and workload execution, measured...
Replay-driven validation made previously hard-to-reproduce interactions and bugs deterministic and repeatable at system level, enabling more focused and efficient debug.
Authors report that deterministic capture/replay converted non-deterministic protocol interactions and transient bugs into repeatable traces that could be inspected and debugged; examples include complex GPU workloads and protocol sequences reproduced end-to-end. (Qualitative/process-level evidence from the demonstrator; no numerical bug-count reduction provided.)
medium positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... repeatability/determinism of intermittent interactions and bugs; debug focus/eff...
A replay-driven validation methodology using deterministic waveform capture and replay from a single design database enables reliable, repeatable system-level reproduction of complex GPU workloads and protocol sequences for tightly coupled CPU–GPU chiplet subsystems.
Applied to a demonstrator SoC building block (ODIN chiplet architecture) integrating a CPU subsystem, multiple Intel Xe GPU cores, and a configurable NoC; deterministic waveform capture during execution and deterministic replay of those waveforms across targets was performed; same design database used to manage captures, traces, and replay sessions. (No large-sample statistical evaluation reported; demonstration limited to the described system.)
medium positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... system-level reproducibility of GPU workloads and inter-chiplet protocol sequenc...
Overall conclusion: forecast-then-execute (anticipatory trajectory reasoning) is an effective principle for building multimodal agents capable of reasoning, planning, and acting in complex environments.
Paper's Conclusion in the provided summary asserts this, based on the reported experimental comparisons and the two-stage TraceR1 framework.
medium positive Anticipatory Planning for Multimodal AI Agents agent capability on complex, multi-step multimodal tasks (planning, reasoning, a...
The paper reports improvements in planning stability (consistency of multi-step plans), execution robustness (success under environment/tool variability), and generalization (out-of-distribution tasks and unseen tool/environment states).
Reported outcomes in the summary explicitly list these three improvement categories; the specific metrics and magnitudes are not provided in the summary.
medium positive Anticipatory Planning for Multimodal AI Agents planning stability, execution robustness, generalization
Compared to reactive agents that optimize actions stepwise without trajectory anticipation, TraceR1 yields better multi-step planning and execution.
Baselines & comparisons described in the summary include reactive agents; the paper reports improvements of TraceR1 relative to these baselines across the benchmarks (no numeric values in the provided text).
medium positive Anticipatory Planning for Multimodal AI Agents multi-step planning stability, execution success rate
Explicit anticipatory (trajectory-level) reasoning is a crucial design principle for reliable multi-step task performance in complex real-world environments.
Paper reports comparisons between anticipatory (trajectory-forecasting) agents and reactive / single-stage baselines, concluding the anticipatory design yields better multi-step reliability; exact experimental details and statistics not included in the provided summary.
medium positive Anticipatory Planning for Multimodal AI Agents multi-step task reliability (task success over sequences), plan coherence
TraceR1 materially improves planning coherence, execution robustness, and generalization in multimodal, tool-using agents versus reactive or single-stage baselines.
Reported evaluation across seven benchmarks (online and offline computer-use, multimodal tool-use reasoning) comparing TraceR1 to reactive agents and single-stage RL baselines; summary states 'substantial gains' though no numerical results are provided in the provided text.
medium positive Anticipatory Planning for Multimodal AI Agents planning coherence (stability), execution robustness (success rate under variabi...
The proposed algorithm's performance is robust to heterogeneous populations in the synthetic experiments (i.e., it continues to find core alternatives under varying degrees of population heterogeneity).
Empirical robustness checks reported in the experiments where population heterogeneity is varied and performance (core-attainment frequency) is evaluated.
medium positive Finding Common Ground in a Sea of Alternatives frequency/proportion of core outcomes as a function of population heterogeneity
The authors compare their sampling algorithm against classical social-choice rules and LLM-based heuristics and report superior core-attainment frequency for their method.
Experimental comparisons described in the paper between the proposed algorithm and baseline methods (classical social-choice rules, LLM-based heuristics) on the synthetic dataset; results summarized in the experiments section.
medium positive Finding Common Ground in a Sea of Alternatives relative frequency/proportion of outputs that lie in the proportional veto core ...
On a synthetic text-preference dataset, the proposed algorithm reliably finds alternatives that lie in the proportional veto core.
Empirical experiments reported in the paper using a synthetic dataset of text preferences; evaluation metric reported as frequency (proportion) of runs where the returned alternative is in the proportional veto core.
medium positive Finding Common Ground in a Sea of Alternatives frequency/proportion of experimental trials producing outcomes in the proportion...
Temporal grounding (restricting models to contemporaneous information) should be adopted as a methodological best practice in economic research using LLMs to avoid leakage and produce more realistic assessments of model forecasting ability.
Study methodology and rationale emphasize temporal grounding; authors recommend it as best practice based on the observed benefits in reducing retrospective contamination.
medium positive When AI Navigates the Fog of War recommended methodological practice adoption (procedural recommendation)
Because the conflict unfolded after the training cutoffs of contemporary frontier LLMs, the dataset and analyses provide an archival, hindsight-free benchmark for studying model reasoning.
Case selection rationale: the 2026 Middle East conflict was deliberately chosen because it occurred after the training cutoffs of the evaluated frontier models; dataset preserves contemporaneous queries and model outputs.
medium positive When AI Navigates the Fog of War availability of a hindsight-free archival benchmark (dataset existence and timin...
Frontier large language models (LLMs) can reason about an unfolding geopolitical crisis using only contemporaneous public information, often demonstrating strategic realism (inferring underlying structural incentives beyond surface rhetoric).
Evaluation across 11 temporally defined nodes during the early 2026 Middle East conflict using 42 node-specific verifiable questions and 5 exploratory prompts; results assessed via verifiability checks and qualitative coding for strategic reasoning of outputs from contemporary frontier LLMs constrained to contemporaneous information.
medium positive When AI Navigates the Fog of War reasoning quality / frequency of responses exhibiting strategic realism (qualita...
BATQuant establishes new state-of-the-art results across multimodal benchmarks for MXFP4-aware PTQ under aggressive quantization.
Comparative benchmark results reported in the paper showing BATQuant outperforming prior PTQ methods on the described multimodal benchmarks (specific benchmark names and quantitative margins not provided in the summary).
medium positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Benchmark performance (accuracy/quality) on multimodal tasks relative to prior P...
Ablation analyses show that each BATQuant component (block-wise transforms, orthogonality relaxation, GPK decomposition, block-wise clipping) contributes to robustness and efficiency.
Reported ablation studies isolating components and measuring their individual impact on performance and overhead in the paper's experiments (exact effect sizes and per-component numbers not given in the summary).
medium positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Task performance (accuracy/quality) and efficiency metrics (storage/runtime) wit...
Block-wise learnable clipping suppresses residual outliers locally and contributes to robustness under aggressive MXFP4 quantization.
Method description and ablation experiments in the paper showing incremental improvement when adding block-wise learnable clipping layers versus not using them; improvements measured on benchmark metrics post-quantization.
medium positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Residual outlier statistics and downstream task performance after applying learn...
Global and Private Kronecker (GPK) decomposition compresses transform parameters, keeping storage and runtime overhead low compared to dense per-block transforms.
Algorithmic contribution described in the paper with reported comparisons (storage/runtime overhead) versus dense per-block transform parameterizations; supported by experimental/implementation measurements (specific memory/runtime numbers not provided in the summary).
medium positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Storage footprint and runtime overhead of transform parameterization (memory and...
Relaxing orthogonality constraints on transforms (i.e., using non-strictly-orthogonal transforms) improves distribution shaping and better fits activations to the limited MXFP quantization range.
Design rationale and ablation studies reported in the paper showing that removing strict orthogonality yields better quantization fit and improved task metrics versus enforced orthogonal transforms.
medium positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Quantization fit (activation distribution shape) and resulting task accuracy/qua...
Aligning transforms to MXFP block granularity using block-wise affine transformations prevents cross-block outlier propagation and avoids the severe collapse seen with rotation-based integer quantization techniques.
Methodological design plus ablation/empirical results in the paper showing improved activation statistics and preserved model accuracy when using block-wise affine transforms aligned to MXFP blocks versus global rotations.
medium positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Activation distribution (outlier propagation) and downstream task performance / ...
Standardized runtime governance frameworks could lower per-deployment compliance engineering costs and increase diffusion of agentic systems.
Theoretical argument that standardization reduces transaction/engineering costs; suggested market dynamics; no empirical implementation evidence.
medium positive Runtime Governance for AI Agents: Policies on Paths per-deployment compliance cost and diffusion rate (adoption)
A market will develop for third-party governance tools, auditors, and insurers providing policy evaluators, risk calibration, and certification services.
Economic argument and analogy to existing markets (governance-as-a-service, insurance); no empirical evidence presented.
medium positive Runtime Governance for AI Agents: Policies on Paths emergence of third-party governance services (market development; presence/size ...
Benchmarking time-sensitivity (via V-DyKnow) can inform procurement decisions: buyers should assess models on their ability to handle temporally sensitive information, not just static benchmarks.
Paper's recommendations and implications section arguing for procurement practices informed by V-DyKnow evaluations.
medium positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... usefulness of benchmark for procurement decision criteria (qualitative)
The authors provide an operational inventory and conversation-analysis tool (the 28-code instrument) that can be reused for monitoring and mitigation by researchers, firms, and regulators.
Paper includes the codebook and describes its application as a re-usable monitoring/analysis instrument; proposed adoption discussed in implications.
medium positive Characterizing Delusional Spirals through Human-LLM Chat Log... availability and intended reusability of the 28-code inventory and analysis meth...
This is the first empirical, message-level study of verified chatbot-related psychological-harm cases (as opposed to speculative discussion).
Authors' positioning in paper; claim of novelty based on review of prior literature and their message-level, verified-case approach.
medium positive Characterizing Delusional Spirals through Human-LLM Chat Log... novelty / contribution described (message-level empirical analysis of verified h...
The authors synthesized complex three-port pixelated output combiners that extend efficiency over back-off using fully symmetrical device implementations.
Design novelty claimed in paper; resulting three-port pixelated combiner layouts were included in the optimization output and used in prototypes. Prototypes used symmetrical device implementations.
medium positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... combiner topology/layout complexity and achieved efficiency across back-off
The CNN EM surrogate enables orders-of-magnitude faster evaluations than full-wave EM simulation, enabling global search of the discrete pixel design space.
Authors state the surrogate provides orders-of-magnitude speedups compared to full-wave EM, enabling global search; no quantitative speedup numbers or benchmarking details are provided in the provided summary.
medium positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... evaluation time per candidate layout (surrogate inference time vs full-wave EM s...
A deep convolutional neural network (CNN) trained as an electromagnetic (EM) surrogate can predict S-parameters of pixelated passive networks quickly and with sufficient accuracy to be used inside an optimizer loop.
Paper reports development and use of a CNN surrogate mapping pixelated network layouts to S-parameters; the surrogate was embedded in the optimizer and used to evaluate candidate layouts during global search. (Note: exact training dataset size, architecture, and error metrics are not provided in the summary.)
medium positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... S-parameter prediction accuracy and inference runtime sufficient for optimizer u...
Empirical evaluation shows the new quasi‑Newton and trust‑region methods outperform baseline sequential methods and prior parallel Newton variants in a combination of speed, memory, stability, and convergence on the tested tasks.
Reported experiments comparing the proposed algorithms to sequential baselines and prior parallel Newton approaches on representative tasks (RNNs, MCMC); qualitative summary claims faster runtimes, lower memory, and improved stability.
medium positive Unifying Optimization and Dynamics to Parallelize Sequential... multi-metric performance: runtime, memory, stability, convergence on benchmark t...
Trust-region methods provide stability and improved convergence reliability across tested tasks.
Empirical comparisons and algorithmic analysis showing trust-region-enabled schemes had fewer divergences and more reliable convergence than prior parallel Newton variants in the evaluated workloads.
medium positive Unifying Optimization and Dynamics to Parallelize Sequential... stability (failure/divergence frequency) and convergence reliability in experime...
Quasi-Newton methods deliver faster runtimes and lower memory use in experiments on RNN inference/training and MCMC chains.
Empirical experiments comparing quasi-Newton implementations to full Newton and sequential baselines on representative tasks (explicit tasks listed: RNN inference/training and MCMC chains); reported qualitative outcomes indicate speed and memory advantages.
medium positive Unifying Optimization and Dynamics to Parallelize Sequential... wall-clock runtime and peak memory usage in experimental tasks
Trust-region variants substantially improve stability and robustness, addressing divergence issues of earlier parallel Newton implementations.
Presentation of trust-region schemes adapting step sizes within the parallel Newton framework; theoretical motivation and empirical results showing reduced divergence/failure rates compared to prior parallel Newton variants.
medium positive Unifying Optimization and Dynamics to Parallelize Sequential... stability metrics (divergence/failure rate), convergence reliability
Quasi-Newton variants are more computationally efficient and memory friendly than full Newton.
Complexity and memory analyses in the thesis plus empirical comparisons on representative tasks (RNNs, MCMC) showing lower runtime and memory usage for quasi-Newton implementations versus full Newton.
medium positive Unifying Optimization and Dynamics to Parallelize Sequential... wall-clock runtime and memory consumption
A Parallel Newton framework, implemented with a parallel associative scan, provides a natural way to parallelize computations across sequence length.
Algorithmic design combining Newton updates with a parallel associative-scan reduction; implementation details and experiments demonstrating the mechanics of the parallel scan across time steps.
medium positive Unifying Optimization and Dynamics to Parallelize Sequential... ability to perform Newton-style updates in parallel across time (scalability / r...
Parallel Newton methods can reliably and efficiently parallelize sequential dynamical systems (e.g., RNNs, MCMC) across sequence length when reframed as nonlinear equation solves.
Thesis presents a reformulation of sequence computation as a global nonlinear system, develops parallel Newton-style algorithms, and reports empirical experiments on representative tasks (RNN inference/training and MCMC chains) comparing runtime and convergence against sequential baselines and prior parallel Newton variants.
medium positive Unifying Optimization and Dynamics to Parallelize Sequential... parallelization speedup / runtime and convergence behavior across sequence lengt...
Adopting this approach shifts required skills and organizational roles away from lengthy parametric modeling toward data engineering, controller integration, and monitoring.
Authors' discussion of practical/organizational implications (qualitative); argument based on removal of model-building step and increased emphasis on data infrastructure and online operations.
medium positive Data-driven generalized perimeter control: Zürich case study changes in required skills/organizational roles (qualitative workforce compositi...
DeePC outperforms baseline controllers (e.g., fixed-time and standard adaptive schemes) in the simulated experiments.
Comparative simulation experiments reported in the paper where DeePC-controlled signals achieve superior system-level metrics relative to baseline controllers.
medium positive Data-driven generalized perimeter control: Zürich case study system-level outcomes (total travel time, CO2 emissions) compared across control...
The method was validated on a very large, high-fidelity microscopic closed-loop simulator of Zürich; the paper reports this as the largest such closed-loop urban-traffic simulation in the literature.
Authors' description of the experimental environment: city-scale microscopic simulator of Zürich with controller in the loop; explicit statement in the paper claiming it is the largest closed-loop urban-traffic simulation reported in the literature.
medium positive Data-driven generalized perimeter control: Zürich case study scale of validation (city-scale microscopic closed-loop simulation)
Regularization and the use of measured Hankel/data matrices make the method more robust to measurement noise and limited data.
Method description includes regularization terms in the DeePC optimization and use of Hankel matrices built from measured trajectories; simulation experiments show continued performance under noisy / limited-data conditions.
medium positive Data-driven generalized perimeter control: Zürich case study robustness to measurement noise and limited data (performance degradation metric...
DeePC handles sparse or limited traffic measurements better than many machine-learning methods.
Claims in the paper supported by experiments and methodological notes: use of Hankel structures and regularization in DeePC to operate with limited/sparse sensing; comparative statements versus generic ML methods (qualitative and simulation evidence).
medium positive Data-driven generalized perimeter control: Zürich case study controller performance (e.g., travel time, emissions) under sparse sensing / lim...
The DeePC-based approach avoids the expensive, time-consuming model-building step required by model-based control methods.
Methodological argument and demonstration that controller uses historical input–output trajectories directly rather than requiring separate parametric model identification; supported by simulation implementation that bypasses model identification.
medium positive Data-driven generalized perimeter control: Zürich case study need for explicit parametric model identification (development time/effort proxy...
Legible decision modes and recorded contest pathways improve verifiability and lower information asymmetries, aiding regulators and platforms in monitoring and reducing litigation/reputational risk.
Analytic claim in the implications section; argued conceptually and tied to proposed logging/audit tools; no empirical validation.
medium positive Designing for Disagreement: Front-End Guardrails for Assista... verifiability/auditability (availability of logs), regulator/platform monitoring...
The pattern can reduce costly misallocations caused by LLM unpredictability by constraining policy options, improving overall allocation efficiency in expectation.
Theoretical argument in the paper tying constrained policy space to reduced variability and misallocation risk; no empirical testing or quantitative model provided.
medium positive Designing for Disagreement: Front-End Guardrails for Assista... allocation efficiency (time-to-help, correct-priority assignments, resource util...