The Commonplace
Home Dashboard Papers Evidence Digests 🎲

Evidence (4560 claims)

Adoption
5267 claims
Productivity
4560 claims
Governance
4137 claims
Human-AI Collaboration
3103 claims
Labor Markets
2506 claims
Innovation
2354 claims
Org Design
2340 claims
Skills & Training
1945 claims
Inequality
1322 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 378 106 59 455 1007
Governance & Regulation 379 176 116 58 739
Research Productivity 240 96 34 294 668
Organizational Efficiency 370 82 63 35 553
Technology Adoption Rate 296 118 66 29 513
Firm Productivity 277 34 68 10 394
AI Safety & Ethics 117 177 44 24 364
Output Quality 244 61 23 26 354
Market Structure 107 123 85 14 334
Decision Quality 168 74 37 19 301
Fiscal & Macroeconomic 75 52 32 21 187
Employment Level 70 32 74 8 186
Skill Acquisition 89 32 39 9 169
Firm Revenue 96 34 22 152
Innovation Output 106 12 21 11 151
Consumer Welfare 70 30 37 7 144
Regulatory Compliance 52 61 13 3 129
Inequality Measures 24 68 31 4 127
Task Allocation 75 11 29 6 121
Training Effectiveness 55 12 12 16 96
Error Rate 42 48 6 96
Worker Satisfaction 45 32 11 6 94
Task Completion Time 78 5 4 2 89
Wages & Compensation 46 13 19 5 83
Team Performance 44 9 15 7 76
Hiring & Recruitment 39 4 6 3 52
Automation Exposure 18 17 9 5 50
Job Displacement 5 31 12 48
Social Protection 21 10 6 2 39
Developer Productivity 29 3 3 1 36
Worker Turnover 10 12 3 25
Skill Obsolescence 3 19 2 24
Creative Output 15 5 3 1 24
Labor Share of Income 10 4 9 23
Clear
Productivity Remove filter
Outcome measures included alignment to the normative taxonomy (coding/automated), recipient-rated perceptions of being heard/validated, and blinded empathy judgments.
Methods section description listing primary and secondary outcomes used in the trial and evaluations.
high null result Practicing with Language Models Cultivates Human Empathic Co... alignment metrics, recipient-rated perceptions, blinded empathy judgments
A data-driven taxonomy was derived mapping common idiomatic empathic moves (e.g., validation, perspective-taking, emotional labeling, offers of support) used in naturalistic support conversations.
Textual analysis of the collected corpus (33,938 messages) produced an operational taxonomy of idiomatic empathic expressions used in the role-play dialogues.
high null result Practicing with Language Models Cultivates Human Empathic Co... taxonomy of empathic communication moves (categorical coding scheme)
The Lend an Ear platform collected a large conversational corpus: 33,938 messages across 2,904 conversations with 968 participants.
Dataset description reported in the paper specifying counts of participants, conversations, and messages used to build and analyze communication patterns.
high null result Practicing with Language Models Cultivates Human Empathic Co... corpus size (number of messages, conversations, participants)
LLM-as-Judge finds no significant difference between the retrieval-augmented and vanilla generators (p = 0.584).
Comparative evaluation using standard LLM-as-Judge metrics reported in the paper on the same experimental setup; reported p-value = 0.584.
high null result HindSight: Evaluating LLM-Generated Research Ideas via Futur... LLM-judge evaluation metric (e.g., LLM-assigned quality/novelty scores for gener...
MessyKitchens is designed to stress occlusion, object variety, and complex inter-object relations (i.e., it is more realistic/physically-rich than prior datasets).
Design and motivation section in paper stating dataset construction targets clutter, occlusion, object variety, and complex object relations; dataset includes explicit contact annotations to capture interactions.
high null result MessyKitchens: Contact-rich object-level 3D scene reconstruc... dataset characteristics: levels of occlusion, object variety, and annotated obje...
MessyKitchens is a high-fidelity real-world dataset of cluttered indoor kitchen scenes with object-level 3D ground truth (object shapes, object poses, and explicit contact information between objects).
Dataset description in paper: collected real-world kitchen scenes and annotated object-level 3D shapes, poses, and contact/interaction labels. (No scene/instance counts provided in the supplied summary.)
high null result MessyKitchens: Contact-rich object-level 3D scene reconstruc... dataset contents: object 3D shapes, object poses, object contact/interaction ann...
The LEAFE algorithmic procedure: summarize environment feedback into compact experience items; backtrack to earlier decision points causally linked to failures and re-explore corrective action branches; distill corrected trajectories into the policy via supervised fine-tuning.
Method section / algorithm description in paper specifying the reflective/backtracking and distillation pipeline as the core of LEAFE.
high null result Internalizing Agency from Reflective Experience N/A (algorithmic procedure description rather than an outcome)
Human-quality proxies were used for evaluation and comparisons were made against Claude Opus 4.6 and other baselines.
Evaluation description: use of human-quality proxy metrics and direct comparisons across models on the 48-brief benchmark.
high null result Learning to Present: Inverse Specification Rewards for Agent... Human-quality proxy scores and comparative model rankings
The reward function is a composite multi-component signal combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics (factuality, coverage, coherence), and an inverse-specification reward.
Reward design section enumerating each component and how they contribute to the composite reward used in RL training.
high null result Learning to Present: Inverse Specification Rewards for Agent... Components of the reward signal used for RL training
The RL environment is OpenEnv-compatible and enables agent tool use for web/knowledge access, planning, and a rendering pipeline.
Methods description: OpenEnv-compatible RL environment with tool interfaces (web/knowledge access and rendering) used during multi-turn planning and execution.
high null result Learning to Present: Inverse Specification Rewards for Agent... Environment capabilities: OpenEnv compatibility and tool-use support
Code for the environment and experiments is released at the specified GitHub repository.
Artifacts: code release reported at https://github.com/pushing-the-frontier/slide-forge-llm.
high null result Learning to Present: Inverse Specification Rewards for Agent... Availability of experiment code (GitHub repo)
The SlideRL dataset of 288 multi-turn rollout trajectories across six models is released for reproducibility.
Artifacts released: SlideRL dataset reported as 288 multi-turn rollouts, hosted at provided Hugging Face URL.
high null result Learning to Present: Inverse Specification Rewards for Agent... Number of rollout trajectories in dataset (288) and coverage across models (6)
Evaluation was conducted on 48 diverse business briefs across six models.
Data & Methods: evaluation suite comprised 48 business briefs selected for diversity; six models compared.
high null result Learning to Present: Inverse Specification Rewards for Agent... Number of evaluation tasks (48 briefs) and number of models compared (6)
Training prompts were derived from expert demonstrations collected using Claude Opus 4.6 to bootstrap training data.
Methods: expert demonstration prompts collected from Claude Opus 4.6 used as seed/bootstrapping data for training.
high null result Learning to Present: Inverse Specification Rewards for Agent... Source of demonstration prompts (Claude Opus 4.6)
Fine-tuning was done parameter-efficiently: only 0.5% of the Qwen2.5-Coder-7B parameters were trained using GRPO.
Methods section: GRPO-based reinforcement learning fine-tuning, with parameter-efficient update covering 0.5% of model parameters.
high null result Learning to Present: Inverse Specification Rewards for Agent... Proportion of model parameters updated during training (0.5%)
Detailed quantitative coverage, throughput, or other numeric validation metrics were not reported beyond the timeline (quarter-level) claim.
Summary states measured benefits were qualitative and process metrics; no detailed quantitative throughput/coverage numbers provided. (Meta-claim about the evidence reported.)
high null result ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... absence of detailed quantitative validation metrics in the reported results
Evaluation used seven benchmarks spanning online computer-use, offline computer-use, and multimodal tool-use reasoning tasks.
Benchmarks section in the summary states seven benchmarks covering those categories; no benchmark names or dataset sizes provided in the summary.
high null result Anticipatory Planning for Multimodal AI Agents benchmark task performance (task success, generalization)
Objectives combine trajectory-level rewards (for global consistency) with stepwise grounded rewards derived from execution outcomes.
Method summary explicitly lists these objectives as part of the TraceR1 training procedure.
high null result Anticipatory Planning for Multimodal AI Agents global plan consistency and stepwise execution outcomes
TraceR1 focuses on short-horizon trajectory forecasting to keep predictions tractable while capturing near-term consequences of actions.
Framework description in summary that emphasizes 'short-horizon trajectory forecasting' as a design choice.
high null result Anticipatory Planning for Multimodal AI Agents forecast horizon (short-horizon) / tractability of predictions
During grounded fine-tuning, tools are treated as frozen agents and only the policy is adjusted using execution feedback (tools are not modified).
Explicit statement in Data & Methods section of the summary describing tool handling during grounded fine-tuning.
high null result Anticipatory Planning for Multimodal AI Agents policy adaptation to tool execution feedback / tool-compatibility of executed ac...
Stage 2 of TraceR1 is a grounded fine-tuning phase that refines step-level accuracy and executability using execution feedback from frozen tool agents.
Method description in summary: Stage 2 — Grounded fine-tuning using execution feedback; tools are not retrained (treated as frozen agents) and feedback is used to adjust the policy.
high null result Anticipatory Planning for Multimodal AI Agents step-level execution accuracy and executability
TraceR1 uses a two-stage training procedure: Stage 1 trains trajectory-level RL on predicted short-horizon trajectories with rewards that enforce global consistency.
Method description in summary: Stage 1 — Trajectory-level RL with trajectory-level rewards to encourage global consistency across predicted action-state sequences.
high null result Anticipatory Planning for Multimodal AI Agents trajectory-level plan coherence / global consistency
Measuring the marginal cost of runtime governance, the tradeoff curve between task completion and compliance risk, and calibrating violation probabilities are open empirical research questions identified by the paper.
Explicit list of open problems and proposed empirical research agenda in the Implications/Measurement sections of the paper.
high null result Runtime Governance for AI Agents: Policies on Paths existence of empirical research gaps (identified/not identified)
No large empirical dataset or large-scale field experiments were used; the work is primarily theoretical/formal with simulations and worked examples rather than empirical validation.
Paper's Methods/Data section explicitly states the work is theoretical/formal and lists reference implementation and simulations instead of large empirical studies.
high null result Runtime Governance for AI Agents: Policies on Paths use of empirical data (presence/absence of large-scale empirical evaluation)
Risk calibration—mapping violation probabilities to enforcement actions and thresholds—is a key unsolved operational problem for runtime governance.
Paper highlights open problems including risk calibration; argued via conceptual analysis and operational concerns (false positives/negatives, costs of blocking actions).
high null result Runtime Governance for AI Agents: Policies on Paths existence of calibrated thresholds and procedures (presence/absence)
Two Doherty power amplifier prototypes with GaN HEMT transistors and three-port pixelated combiners were fabricated and tested at 2.75 GHz.
Paper reports fabrication of two prototypes built with GaN HEMT transistors and the optimized three-port pixelated combiners; RF characterization performed at 2.75 GHz.
high null result Deep Learning-Driven Black-Box Doherty Power Amplifier with ... number of fabricated prototypes and test frequency (2.75 GHz)
Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity.
Methods and experimental sections specifying the metrics applied to compare ESE and baselines on RetailBench environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... operational stability, efficiency, and robustness/degradation metrics
Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation.
Experimental design: baseline descriptions in the methods section specifying monolithic LLM agents and additional architectures lacking explicit temporal decomposition.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... baseline agent architectures used for comparison
Eight state-of-the-art LLMs were evaluated in the study.
Experimental setup description listing eight contemporary LLMs tested across RetailBench environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... number of LLMs evaluated (n = 8)
The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection).
Architectural design described in the methods: explicit decomposition into strategy and execution modules with differing update cadences and stated interpretability/adaptation mechanisms.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... agent architectural modularity (temporal decomposition into strategy vs executio...
RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity).
Benchmark construction described in the paper: multiple environment difficulty levels used to evaluate degradation under increasing challenge; experiments run across these progressive environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... environment difficulty gradient (complexity/stochasticity/non-stationarity level...
The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity).
Design and presentation of the benchmark in the paper: simulated commercial operations with stochastic demand processes and shifting external factors; emphasis on long-horizon evaluation and progressively challenging environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... benchmark realism and coverage of non-stationarity for long-horizon decision-mak...
The reinforcement learning objective optimizes a combined utility that trades off task success and resource costs; the reward penalizes delays and failures.
Learning method section describes training the high-level orchestrator with an RL reward that penalizes delays (latency/resource consumption) and failures, and that algorithmic/hyperparameter details are provided.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... training objective: combined utility of task success and resource cost
The experiments use empirical LLM latency profiles measured from ALFRED tasks to model realistic inference delays in simulation.
Environment/evaluation description states use of an embodied task suite based on ALFRED and empirical latency profiles to model realistic LLM inference delays.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... latency modeling (empirical latency profiles)
Baselines for comparison include fixed reasoning strategies (always reason, never reason), heuristic triggers for invoking LLMs, and ablations of RARRL components.
Paper lists these baselines explicitly in the Baselines and comparisons section and reports experiments comparing RARRL to them.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... baseline policy types used for comparison
The high-level orchestration policy uses observations that include current sensory observation, execution history, and remaining resources (e.g., remaining time or compute budget).
Key Points and Methods specify the observation space used by the orchestrator, listing sensory inputs, execution history, and resource remaining as inputs.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... policy input features (sensory observation, execution history, remaining resourc...
RARRL trains only a high-level orchestration policy via reinforcement learning and does not retrain the existing low-level control/policy modules end-to-end.
Methods/Model architecture describe a hierarchical approach where low-level controllers are existing modules and are not retrained; RL is applied to the high-level orchestrator.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... level of learning: high-level orchestration policy trained vs. low-level control...
RARRL (Resource-Aware Reasoning via Reinforcement Learning) is a hierarchical orchestration framework that learns a high-level policy to decide when an embodied agent should invoke LLM-based reasoning, which reasoning role to use, and how much compute budget to allocate.
Paper describes a hierarchical design with a learned high-level RL orchestrator that issues discrete decisions about reasoning invocation, reasoning role/mode, and compute budget allocation; architecture and decision space specified in Methods.
high null result When Should a Robot Think? Resource-Aware Reasoning via Rein... decision variables: whether to call an LLM, reasoning role/mode selected, comput...
Pilot randomized or quasi-experimental implementations of reduced workweeks (across firms, industries, or regions) are needed to measure effects on employment, productivity, wages, and consumption.
Research-design recommendation motivated by lack of contemporary causal evidence; not an empirical finding but a stated priority for rigorous testing.
high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... measured causal effects of reduced workweeks on employment, productivity, wages,...
There is limited direct causal identification separating technology-driven layoffs from incentive-driven layoffs in current firm-level data, creating a need for new firm-panel datasets linking AI adoption, executive pay/ownership, layoff decisions, and local demand outcomes.
Stated limitation of the paper and research-priority recommendation; assessment based on literature gaps noted in the synthesis rather than empirical gap quantification.
high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... availability/coverage of firm-level panel data capable of separating AI effects ...
Observed layoffs should be treated in empirical research as outcomes of firm governance and incentive structures; econometric studies estimating displacement from AI must control for managerial incentives and financial pressures.
Methodological recommendation based on the conceptual argument and literature linking governance/incentives to firm behavior; no new empirical demonstration provided.
high null result A Shorter Workweek as a Policy Response to AI-Driven Labor D... bias in estimated causal effect of AI on layoffs when not controlling for manage...
Research priorities include empirical testing and simulation of ISB-based control systems, cost–benefit analysis of proactive versus reactive AI governance, and distributional impact assessments.
Explicit research agenda proposed by the author (conceptual recommendation), not empirical results.
high null result DIGITAL TRANSFORMATION OF THE RUSSIAN FEDERATION’S SOCIOECON... n/a (research agenda recommendation rather than an empirical outcome)
Further research is needed—randomized controlled trials, long-term impact measurement (earnings, employment stability, skill accumulation), distributional analysis, and model audits for bias.
Authors' stated research agenda and recommendations; not an empirical finding but a methodological recommendation following the pilot.
high null result AI-Driven Skill Mapping and Gig Economy Matching Algorithm f... long-term earnings, employment stability, skill accumulation, distributional out...
The authors explicitly note limitations: the study focuses on prediction (not causation), results are sensitive to data quality, workforce records may contain biases, and practical constraints like privacy and deployment complexity limit direct operational adoption.
Limitations section described by the authors listing prediction-versus-causation distinction, sensitivity to data quality, potential biases, privacy concerns, and deployment complexity.
high null result Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Scope and limitations of study conclusions (qualitative)
The study used a reproducible modeling pipeline (data cleaning, feature engineering, model training and tuning, systematic evaluation) applied to several freely available workforce datasets to enable replication.
Methods section describes a reproducible workflow including preprocessing steps, engineered features, hyperparameter tuning for each model class, cross-validation, and use of publicly available datasets.
high null result Adoption of AI-Based HR Analytics and Its Impact on Firm Pro... Reproducibility of predictive modeling workflow (procedural, not an empirical pe...
This work is conceptual/theoretical and reports no original empirical dataset; it explicitly calls for mixed-methods empirical validation (case studies, field experiments, longitudinal studies), measurement development, and multi-level data collection.
Explicit methodological statement in the paper describing its nature as a theoretical synthesis and listing empirical needs; no empirical sample provided.
high null result Revolutionizing Human Resource Development: A Theoretical Fr... presence/absence of original empirical data in the paper (none)
Four autonomous agents were benchmarked on the same fresh CTF challenge set alongside human teams.
Benchmarking experiment described in the study: four autonomous AI agents evaluated on the identical fresh challenge set used in the live onsite CTF.
high null result Understanding Human-AI Collaboration in Cybersecurity Compet... agent performance metrics on the fresh CTF challenge set (success rates, traject...
Data and methods: the study used an online experiment with 861 online-retail employees performing short-duration, virtual, task-focused collaborations; analyses focused on direct effects, moderation (emotion and partner type), mediation (service empathy), and moderated-mediation.
Methods description in the paper specifying design, sample size (n = 861), task context (temporary virtual teamwork), and analytic approach (hypothesis tests including moderation and mediation analyses).
high null result Adoption of AI partners in temporary tasks: exploring the ef... NA (methodological claim about study design and analyses)
Teamwork partner type (human vs AI) has no direct, significant effect on collaboration proficiency for temporary virtual tasks.
Online experiment with employees in the online-retail industry (n = 861). Hypothesis testing showed no significant main effect of partner type on the outcome variable 'collaboration proficiency' in the reported analyses.
high null result Adoption of AI partners in temporary tasks: exploring the ef... collaboration proficiency
The paper recommends an empirical research agenda including field experiments comparing teams with and without AI mediation, structural models of labor supply and wages under reduced language frictions, microdata analysis of adopters, and measurement studies for coordination costs and mediated-action reliability.
Explicit recommendations and research agenda stated in the paper; this is a descriptive claim about the paper's content rather than an empirical finding.
high null result AI as a universal collaboration layer: Eliminating language ... existence of the recommended research agenda items in the paper