The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (7953 claims)

Adoption
5539 claims
Productivity
4793 claims
Governance
4333 claims
Human-AI Collaboration
3326 claims
Labor Markets
2657 claims
Innovation
2510 claims
Org Design
2469 claims
Skills & Training
2017 claims
Inequality
1378 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 402 112 67 480 1076
Governance & Regulation 402 192 122 62 790
Research Productivity 249 98 34 311 697
Organizational Efficiency 395 95 70 40 603
Technology Adoption Rate 321 126 73 39 564
Firm Productivity 306 39 70 12 432
Output Quality 256 66 25 28 375
AI Safety & Ethics 116 177 44 24 363
Market Structure 107 128 85 14 339
Decision Quality 177 76 38 20 315
Fiscal & Macroeconomic 89 58 33 22 209
Employment Level 77 34 80 9 202
Skill Acquisition 92 33 40 9 174
Innovation Output 120 12 23 12 168
Firm Revenue 98 34 22 154
Consumer Welfare 73 31 37 7 148
Task Allocation 84 16 33 7 140
Inequality Measures 25 77 32 5 139
Regulatory Compliance 54 63 13 3 133
Error Rate 44 51 6 101
Task Completion Time 88 5 4 3 100
Training Effectiveness 58 12 12 16 99
Worker Satisfaction 47 32 11 7 97
Wages & Compensation 53 15 20 5 93
Team Performance 47 12 15 7 82
Automation Exposure 24 22 9 6 62
Job Displacement 6 38 13 57
Hiring & Recruitment 41 4 6 3 54
Developer Productivity 34 4 3 1 42
Social Protection 22 10 6 2 40
Creative Output 16 7 5 1 29
Labor Share of Income 12 5 9 26
Skill Obsolescence 3 20 2 25
Worker Turnover 10 12 3 25
The paper organizes production failure modes across five dimensions—server contracts, user context, timeouts, errors, and observability—and provides concrete failure vignettes from an enterprise deployment.
Taxonomy and failure vignettes are listed as design artifacts and deliverables in the paper; derived from observational analysis of production logs and incidents.
high null result Bridging Protocol and Production: Design Patterns for Deploy... classification coverage of failure incidents across the five dimensions
The experiment used NYSE TAQ transaction and quote data for SPY covering 2015–2024 and tested six pre-specified hypotheses about market-quality trends.
Data and methods section specifying dataset (NYSE TAQ SPY, 2015–2024), the number of pre-specified hypotheses (six), and experimental protocol with 150 autonomous agents.
high null result Nonstandard Errors in AI Agents dataset and experimental design variables (data coverage, number of hypotheses t...
Agents' methodological choices and resulting effect estimates were systematically recorded and used to quantify dispersion and measure switching across stages.
Study design description: recorded agents' methodological choices (measure selection, estimation procedures), resulting estimates, and tracked switching and dispersion metrics (IQR) across the three-stage protocol applied to SPY TAQ data (2015–2024) with 150 agents.
high null result Nonstandard Errors in AI Agents recorded methodological choices (categorical), effect estimates (continuous), di...
AI peer review (agents exchanging written critiques) produced minimal reduction in dispersion of estimates.
Three-stage protocol: after stage 1 (independent analyses) and stage 2 (AI peer review), measured dispersion (e.g., IQR) across agents showed little change following the peer-review stage across the six hypotheses and agent pool (n=150).
high null result Nonstandard Errors in AI Agents change in dispersion (IQR) of estimates between independent-analysis stage and p...
The work is qualitative and exploratory — presenting naturalistic phenomena rather than causal empirical estimates, and is intended to be hypothesis-generating rather than definitive.
Methodology explicitly stated: naturalistic, qualitative daily observations over one month across multiple platforms; comparative observational documentation without experimental manipulation or causal identification.
high null result When Openclaw Agents Learn from Each Other: Insights from Em... nature of evidence (qualitative/exploratory vs. causal inference)
Future empirical work should measure calibration (user trust vs. model accuracy), hallucination rate, user comprehension of capability limits, and behavioral dependence on system recommendations.
Explicit methodological recommendations and suggested metrics in the paper; these are proposed future measurements rather than reported findings.
high null result Why We Need to Destroy the Illusion of Speaking to A Human: ... calibration metrics, hallucination rates, user comprehension, behavioral depende...
Conversational AI differs from interpersonal conversation: it has no true beliefs/intentions or accountability and produces probabilistic, sometimes inconsistent outputs with opaque training/data provenance.
Analytical/distinctive claim based on properties of LLMs and machine learning models discussed in the paper; conceptual analysis, no empirical testing.
high null result Why We Need to Destroy the Illusion of Speaking to A Human: ... ontological status of AI outputs (beliefs/intentions/accountability) and propert...
CoMAI is a modular, four-agent interview-assessment framework coordinated by a centralized finite-state machine.
System design and implementation described in the paper: a pipeline of four specialized agents (question generation, security/validation, scoring by rubric, summarization/reporting) with a centralized finite-state machine enforcing workflow and information flow constraints.
high null result CoMAI: A Collaborative Multi-Agent Framework for Robust and ... system architecture (agent decomposition and FSM coordination)
Field experiments (A/B testing) and willingness-to-pay experiments are necessary to quantify monetary benefits, adoption curves, and optimal pricing for alignment capabilities.
Paper explicitly recommends these empirical approaches in the recommendations for economists and product teams; this is a methodological recommendation rather than an empirical finding.
high null result A Context Alignment Pre-processor for Enhancing the Coherenc... adoption rates, willingness-to-pay, retention, task completion differences acros...
Recommended evaluation directions include automatic metrics (embedding similarity, task success, turn counts), human evaluation (satisfaction, perceived collaboration), and A/B testing in deployed settings (latency, compute, retention).
Paper's explicit evaluation proposals and recommended metrics listed in the Data & Methods and Evaluation Directions sections; these are prescriptive recommendations rather than executed experiments.
high null result A Context Alignment Pre-processor for Enhancing the Coherenc... specified evaluation metrics (task success rate, turn counts, retention, latency...
The paper focuses on architecture and conceptual arguments rather than reporting large-scale empirical datasets or results.
Data & Methods section and overall document framing emphasize architecture description and proposed evaluations; explicitly notes absence of large-scale empirical results in the provided summary.
high null result A Context Alignment Pre-processor for Enhancing the Coherenc... presence/absence of large-scale empirical evaluation
Alignment verification can be implemented using semantic embeddings (cosine similarity) or learned classifiers with threshold-based decision branching.
Paper describes these as recommended implementation approaches for the alignment verification component; no empirical benchmark comparing methods is reported.
high null result A Context Alignment Pre-processor for Enhancing the Coherenc... similarity scores, classifier accuracy, false positive/negative rates for drift ...
Temporal decay in the retrieval component can be modeled with functions such as exponential decay and a tunable half-life parameter applied to dialogue-turn embeddings.
Methodological description in the paper specifying temporal decay modeling options (exponential decay example) and tunable parameters; descriptive claim about intended implementation (no empirical comparison of decay functions provided).
high null result A Context Alignment Pre-processor for Enhancing the Coherenc... decay parameter values / impact of decay function on retrieval weighting
Research agenda items for economists include: quantifying willingness-to-pay for verifiable reasoning, studying labor-market impacts for validators, designing contracts/mechanisms to incentivize truthful argument provision, and evaluating regulatory interventions.
Paper's stated research and policy agenda; prescriptive rather than empirical.
high null result Argumentative Human-AI Decision-Making: Toward AI Agents Tha... existence and prioritization of empirical research on WTP, labor impacts, mechan...
Evaluation currently lacks metrics and benchmarks for argument quality, fidelity, contestability, and human trust; developing these is necessary.
Paper notes the gap and proposes evaluation metrics and experimental designs; no new benchmarks introduced.
high null result Argumentative Human-AI Decision-Making: Toward AI Agents Tha... availability and maturity of evaluation metrics and benchmarks
Methodology is primarily conceptual and normative: the paper synthesizes policy texts, safety standards, and crisis-management literature and relies on illustrative mappings and thought experiments rather than new empirical field data.
Authors' methodological description in the Data & Methods section (explicit statement about sources and use of thought experiments).
high null result Resilience Meets Autonomy: Governing Embodied AI in Critical... methodological characterization (use of conceptual synthesis vs. empirical data ...
The paper defines and specifies four oversight modes (spanning near-full autonomy to strict human control) and provides criteria for selecting modes based on task complexity, risk level, and consequence severity.
Conceptual taxonomy developed in the paper; mapping exercises and triage framework (risk–complexity–consequence) presented as illustrative mappings (no empirical testing).
high null result Resilience Meets Autonomy: Governing Embodied AI in Critical... existence and specification of four oversight modes and their mapping criteria (...
Sample sizes reported: human–AI experiment n = 126; human–human benchmark n = 108.
Study's Data & Methods section reporting sample sizes for the human–AI experiment (n = 126) and citing the human–human benchmark (Dvorak & Fehrler 2024, n = 108).
Experimental design: subjects played an indefinitely repeated Prisoner’s Dilemma in supergames with two between-subjects treatments varying chat timing (chat only before first round of each supergame vs chat before every round); the AI partner was GPT-5.2.
Methods description of the lab experiment reported in the paper: indefinitely repeated PD in supergames, two chat-frequency between-subjects treatments, AI implemented as GPT-5.2; human–AI sample n = 126.
high null result Playing Against the Machine: Cooperation, Communication, and... experimental treatment specification (chat-frequency manipulation; AI identity)
Allowing repeated pre-play communication (chat before every round) has no detectable effect on cooperation rates when the partner is an AI.
Between-subjects manipulation within the human–AI experiment comparing chat-before-first-round vs chat-before-every-round treatments (human–AI n = 126 total); statistical comparison of cooperation rates across the two chat-frequency treatments showed no detectable difference.
high null result Playing Against the Machine: Cooperation, Communication, and... effect of chat frequency on cooperation rate (difference in cooperation between ...
Initial cooperation rates against the AI (GPT-5.2) are high and comparable to initial cooperation in human–human pairs.
Laboratory experiment with human subjects playing an indefinitely repeated Prisoner’s Dilemma against an AI chatbot (GPT-5.2); human–AI sample n = 126; human–human benchmark taken from Dvorak & Fehrler (2024) with n = 108; comparison of initial-round / early-round cooperation rates across conditions.
high null result Playing Against the Machine: Cooperation, Communication, and... initial cooperation rate (cooperation in early rounds / first round of supergame...
Evaluation metrics for the architecture should include sample efficiency, generalization across tasks, robustness to distribution shift, autonomy (fraction of learning decisions made internally), transfer speed, lifelong retention, and safety/constraint adherence.
Explicit recommendations for evaluation metrics in the paper.
high null result Why AI systems don't learn and what to do about it: Lessons ... listed evaluation metrics (sample efficiency; generalization; robustness; autono...
This paper is a conceptual/theoretical architecture proposal rather than an empirical study; empirical validation should follow via suggested experiments.
Explicit statement in the paper about nature of contribution.
high null result Why AI systems don't learn and what to do about it: Lessons ... N/A (no empirical outcomes reported)
Results are from role-play contexts and short-term interventions; economic estimates of benefit require validation in field settings, across diverse populations, and with different LLM models.
Authors' caveats and limitations stated in the paper noting external validity concerns and the experimental context (role-play, short-term follow-up).
high null result Practicing with Language Models Cultivates Human Empathic Co... generalizability/external validity (not directly measured)
Outcome measures included alignment to the normative taxonomy (coding/automated), recipient-rated perceptions of being heard/validated, and blinded empathy judgments.
Methods section description listing primary and secondary outcomes used in the trial and evaluations.
high null result Practicing with Language Models Cultivates Human Empathic Co... alignment metrics, recipient-rated perceptions, blinded empathy judgments
A data-driven taxonomy was derived mapping common idiomatic empathic moves (e.g., validation, perspective-taking, emotional labeling, offers of support) used in naturalistic support conversations.
Textual analysis of the collected corpus (33,938 messages) produced an operational taxonomy of idiomatic empathic expressions used in the role-play dialogues.
high null result Practicing with Language Models Cultivates Human Empathic Co... taxonomy of empathic communication moves (categorical coding scheme)
The Lend an Ear platform collected a large conversational corpus: 33,938 messages across 2,904 conversations with 968 participants.
Dataset description reported in the paper specifying counts of participants, conversations, and messages used to build and analyze communication patterns.
high null result Practicing with Language Models Cultivates Human Empathic Co... corpus size (number of messages, conversations, participants)
Suggested empirical research directions for AI economists include: comparing LLM performance and economic outcomes on rule‑encodable vs tacit tasks; quantifying performance decline when forcing LLMs into interpretable rule representations; studying contracting/pricing where buyers cannot verify internal rules; and measuring returns to scale attributable to tacit capabilities.
Explicitly enumerated recommended research agenda items in the paper; these are proposed studies rather than executed work.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed empirical research topics and corresponding outcomes to measure
New metrics are needed to value tacit capabilities — e.g., measures of transfer, generalization under distribution shifts, ease of integrating with human workflows, and irreducibility to compressed rule representations.
Methodological recommendation in the paper listing specific metric categories for future empirical work.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed metrics for assessing tacit LLM capabilities
Suggested empirical validations (not performed) include benchmarking LLMs versus rule systems on allegedly rule‑encodable tasks, attempting rule extraction and measuring fidelity loss, and compression/distillation studies to quantify irreducible task performance.
Recommendations and proposed experimental directions listed in the paper; these are proposals, not executed studies.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... types of empirical tests recommended for validating the thesis
The paper contains mostly qualitative and historically grounded empirical content and reports no primary datasets or large‑scale experimental results in support of the formal thesis.
Explicit declaration in the Data & Methods section that empirical content is qualitative/historical and no new datasets were collected.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... extent of empirical/quantitative evidence presented
The paper's core methodological approach is conceptual and theoretical argumentation (formal/logical proof, historical examples, and philosophical framing), not empirical experimentation.
Stated Data & Methods description indicating reliance on formal logic, historical case analysis, and philosophical argument; absence of primary datasets.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... presence/absence of empirical experiments in the paper
LLM-as-Judge finds no significant difference between the retrieval-augmented and vanilla generators (p = 0.584).
Comparative evaluation using standard LLM-as-Judge metrics reported in the paper on the same experimental setup; reported p-value = 0.584.
high null result HindSight: Evaluating LLM-Generated Research Ideas via Futur... LLM-judge evaluation metric (e.g., LLM-assigned quality/novelty scores for gener...
MessyKitchens is designed to stress occlusion, object variety, and complex inter-object relations (i.e., it is more realistic/physically-rich than prior datasets).
Design and motivation section in paper stating dataset construction targets clutter, occlusion, object variety, and complex object relations; dataset includes explicit contact annotations to capture interactions.
high null result MessyKitchens: Contact-rich object-level 3D scene reconstruc... dataset characteristics: levels of occlusion, object variety, and annotated obje...
MessyKitchens is a high-fidelity real-world dataset of cluttered indoor kitchen scenes with object-level 3D ground truth (object shapes, object poses, and explicit contact information between objects).
Dataset description in paper: collected real-world kitchen scenes and annotated object-level 3D shapes, poses, and contact/interaction labels. (No scene/instance counts provided in the supplied summary.)
high null result MessyKitchens: Contact-rich object-level 3D scene reconstruc... dataset contents: object 3D shapes, object poses, object contact/interaction ann...
The LEAFE algorithmic procedure: summarize environment feedback into compact experience items; backtrack to earlier decision points causally linked to failures and re-explore corrective action branches; distill corrected trajectories into the policy via supervised fine-tuning.
Method section / algorithm description in paper specifying the reflective/backtracking and distillation pipeline as the core of LEAFE.
high null result Internalizing Agency from Reflective Experience N/A (algorithmic procedure description rather than an outcome)
Human-quality proxies were used for evaluation and comparisons were made against Claude Opus 4.6 and other baselines.
Evaluation description: use of human-quality proxy metrics and direct comparisons across models on the 48-brief benchmark.
high null result Learning to Present: Inverse Specification Rewards for Agent... Human-quality proxy scores and comparative model rankings
The reward function is a composite multi-component signal combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics (factuality, coverage, coherence), and an inverse-specification reward.
Reward design section enumerating each component and how they contribute to the composite reward used in RL training.
high null result Learning to Present: Inverse Specification Rewards for Agent... Components of the reward signal used for RL training
The RL environment is OpenEnv-compatible and enables agent tool use for web/knowledge access, planning, and a rendering pipeline.
Methods description: OpenEnv-compatible RL environment with tool interfaces (web/knowledge access and rendering) used during multi-turn planning and execution.
high null result Learning to Present: Inverse Specification Rewards for Agent... Environment capabilities: OpenEnv compatibility and tool-use support
Code for the environment and experiments is released at the specified GitHub repository.
Artifacts: code release reported at https://github.com/pushing-the-frontier/slide-forge-llm.
high null result Learning to Present: Inverse Specification Rewards for Agent... Availability of experiment code (GitHub repo)
The SlideRL dataset of 288 multi-turn rollout trajectories across six models is released for reproducibility.
Artifacts released: SlideRL dataset reported as 288 multi-turn rollouts, hosted at provided Hugging Face URL.
high null result Learning to Present: Inverse Specification Rewards for Agent... Number of rollout trajectories in dataset (288) and coverage across models (6)
Evaluation was conducted on 48 diverse business briefs across six models.
Data & Methods: evaluation suite comprised 48 business briefs selected for diversity; six models compared.
high null result Learning to Present: Inverse Specification Rewards for Agent... Number of evaluation tasks (48 briefs) and number of models compared (6)
Training prompts were derived from expert demonstrations collected using Claude Opus 4.6 to bootstrap training data.
Methods: expert demonstration prompts collected from Claude Opus 4.6 used as seed/bootstrapping data for training.
high null result Learning to Present: Inverse Specification Rewards for Agent... Source of demonstration prompts (Claude Opus 4.6)
Fine-tuning was done parameter-efficiently: only 0.5% of the Qwen2.5-Coder-7B parameters were trained using GRPO.
Methods section: GRPO-based reinforcement learning fine-tuning, with parameter-efficient update covering 0.5% of model parameters.
high null result Learning to Present: Inverse Specification Rewards for Agent... Proportion of model parameters updated during training (0.5%)
Detailed quantitative coverage, throughput, or other numeric validation metrics were not reported beyond the timeline (quarter-level) claim.
Summary states measured benefits were qualitative and process metrics; no detailed quantitative throughput/coverage numbers provided. (Meta-claim about the evidence reported.)
high null result ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... absence of detailed quantitative validation metrics in the reported results
Evaluation used seven benchmarks spanning online computer-use, offline computer-use, and multimodal tool-use reasoning tasks.
Benchmarks section in the summary states seven benchmarks covering those categories; no benchmark names or dataset sizes provided in the summary.
high null result Anticipatory Planning for Multimodal AI Agents benchmark task performance (task success, generalization)
Objectives combine trajectory-level rewards (for global consistency) with stepwise grounded rewards derived from execution outcomes.
Method summary explicitly lists these objectives as part of the TraceR1 training procedure.
high null result Anticipatory Planning for Multimodal AI Agents global plan consistency and stepwise execution outcomes
TraceR1 focuses on short-horizon trajectory forecasting to keep predictions tractable while capturing near-term consequences of actions.
Framework description in summary that emphasizes 'short-horizon trajectory forecasting' as a design choice.
high null result Anticipatory Planning for Multimodal AI Agents forecast horizon (short-horizon) / tractability of predictions
During grounded fine-tuning, tools are treated as frozen agents and only the policy is adjusted using execution feedback (tools are not modified).
Explicit statement in Data & Methods section of the summary describing tool handling during grounded fine-tuning.
high null result Anticipatory Planning for Multimodal AI Agents policy adaptation to tool execution feedback / tool-compatibility of executed ac...
Stage 2 of TraceR1 is a grounded fine-tuning phase that refines step-level accuracy and executability using execution feedback from frozen tool agents.
Method description in summary: Stage 2 — Grounded fine-tuning using execution feedback; tools are not retrained (treated as frozen agents) and feedback is used to adjust the policy.
high null result Anticipatory Planning for Multimodal AI Agents step-level execution accuracy and executability