The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13870 claims)

Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 749 196 98 892 1984
Governance & Regulation 817 394 188 121 1544
Organizational Efficiency 771 189 124 83 1177
Technology Adoption Rate 627 233 123 96 1088
Research Productivity 411 123 56 332 933
Output Quality 467 178 59 47 751
Decision Quality 320 174 75 42 618
Firm Productivity 435 55 88 20 604
AI Safety & Ethics 214 276 65 33 593
Market Structure 178 167 122 24 496
Task Allocation 207 64 71 32 379
Skill Acquisition 165 59 60 17 301
Innovation Output 203 27 43 18 292
Employment Level 105 52 107 13 279
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 116 63 42 11 232
Firm Revenue 150 48 26 3 227
Inequality Measures 44 122 49 6 221
Task Completion Time 169 29 8 12 219
Worker Satisfaction 89 63 20 12 184
Error Rate 69 92 10 2 173
Regulatory Compliance 76 68 14 5 163
Training Effectiveness 93 21 13 19 148
Wages & Compensation 77 36 25 6 144
Automation Exposure 51 54 22 12 142
Team Performance 86 17 27 9 140
Developer Productivity 94 17 14 6 132
Job Displacement 12 80 20 1 113
Hiring & Recruitment 51 7 8 3 69
Creative Output 31 17 7 3 59
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
New metrics are needed to value tacit capabilities — e.g., measures of transfer, generalization under distribution shifts, ease of integrating with human workflows, and irreducibility to compressed rule representations.
Methodological recommendation in the paper listing specific metric categories for future empirical work.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... proposed metrics for assessing tacit LLM capabilities
Suggested empirical validations (not performed) include benchmarking LLMs versus rule systems on allegedly rule‑encodable tasks, attempting rule extraction and measuring fidelity loss, and compression/distillation studies to quantify irreducible task performance.
Recommendations and proposed experimental directions listed in the paper; these are proposals, not executed studies.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... types of empirical tests recommended for validating the thesis
The paper contains mostly qualitative and historically grounded empirical content and reports no primary datasets or large‑scale experimental results in support of the formal thesis.
Explicit declaration in the Data & Methods section that empirical content is qualitative/historical and no new datasets were collected.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... extent of empirical/quantitative evidence presented
The paper's core methodological approach is conceptual and theoretical argumentation (formal/logical proof, historical examples, and philosophical framing), not empirical experimentation.
Stated Data & Methods description indicating reliance on formal logic, historical case analysis, and philosophical argument; absence of primary datasets.
high null result Why the Valuable Capabilities of LLMs Are Precisely the Unex... presence/absence of empirical experiments in the paper
LLM-as-Judge finds no significant difference between the retrieval-augmented and vanilla generators (p = 0.584).
Comparative evaluation using standard LLM-as-Judge metrics reported in the paper on the same experimental setup; reported p-value = 0.584.
high null result HindSight: Evaluating LLM-Generated Research Ideas via Futur... LLM-judge evaluation metric (e.g., LLM-assigned quality/novelty scores for gener...
MessyKitchens is designed to stress occlusion, object variety, and complex inter-object relations (i.e., it is more realistic/physically-rich than prior datasets).
Design and motivation section in paper stating dataset construction targets clutter, occlusion, object variety, and complex object relations; dataset includes explicit contact annotations to capture interactions.
high null result MessyKitchens: Contact-rich object-level 3D scene reconstruc... dataset characteristics: levels of occlusion, object variety, and annotated obje...
MessyKitchens is a high-fidelity real-world dataset of cluttered indoor kitchen scenes with object-level 3D ground truth (object shapes, object poses, and explicit contact information between objects).
Dataset description in paper: collected real-world kitchen scenes and annotated object-level 3D shapes, poses, and contact/interaction labels. (No scene/instance counts provided in the supplied summary.)
high null result MessyKitchens: Contact-rich object-level 3D scene reconstruc... dataset contents: object 3D shapes, object poses, object contact/interaction ann...
The LEAFE algorithmic procedure: summarize environment feedback into compact experience items; backtrack to earlier decision points causally linked to failures and re-explore corrective action branches; distill corrected trajectories into the policy via supervised fine-tuning.
Method section / algorithm description in paper specifying the reflective/backtracking and distillation pipeline as the core of LEAFE.
high null result Internalizing Agency from Reflective Experience N/A (algorithmic procedure description rather than an outcome)
Human-quality proxies were used for evaluation and comparisons were made against Claude Opus 4.6 and other baselines.
Evaluation description: use of human-quality proxy metrics and direct comparisons across models on the 48-brief benchmark.
high null result Learning to Present: Inverse Specification Rewards for Agent... Human-quality proxy scores and comparative model rankings
The reward function is a composite multi-component signal combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics (factuality, coverage, coherence), and an inverse-specification reward.
Reward design section enumerating each component and how they contribute to the composite reward used in RL training.
high null result Learning to Present: Inverse Specification Rewards for Agent... Components of the reward signal used for RL training
The RL environment is OpenEnv-compatible and enables agent tool use for web/knowledge access, planning, and a rendering pipeline.
Methods description: OpenEnv-compatible RL environment with tool interfaces (web/knowledge access and rendering) used during multi-turn planning and execution.
high null result Learning to Present: Inverse Specification Rewards for Agent... Environment capabilities: OpenEnv compatibility and tool-use support
Code for the environment and experiments is released at the specified GitHub repository.
Artifacts: code release reported at https://github.com/pushing-the-frontier/slide-forge-llm.
high null result Learning to Present: Inverse Specification Rewards for Agent... Availability of experiment code (GitHub repo)
The SlideRL dataset of 288 multi-turn rollout trajectories across six models is released for reproducibility.
Artifacts released: SlideRL dataset reported as 288 multi-turn rollouts, hosted at provided Hugging Face URL.
high null result Learning to Present: Inverse Specification Rewards for Agent... Number of rollout trajectories in dataset (288) and coverage across models (6)
Evaluation was conducted on 48 diverse business briefs across six models.
Data & Methods: evaluation suite comprised 48 business briefs selected for diversity; six models compared.
high null result Learning to Present: Inverse Specification Rewards for Agent... Number of evaluation tasks (48 briefs) and number of models compared (6)
Training prompts were derived from expert demonstrations collected using Claude Opus 4.6 to bootstrap training data.
Methods: expert demonstration prompts collected from Claude Opus 4.6 used as seed/bootstrapping data for training.
high null result Learning to Present: Inverse Specification Rewards for Agent... Source of demonstration prompts (Claude Opus 4.6)
Fine-tuning was done parameter-efficiently: only 0.5% of the Qwen2.5-Coder-7B parameters were trained using GRPO.
Methods section: GRPO-based reinforcement learning fine-tuning, with parameter-efficient update covering 0.5% of model parameters.
high null result Learning to Present: Inverse Specification Rewards for Agent... Proportion of model parameters updated during training (0.5%)
Detailed quantitative coverage, throughput, or other numeric validation metrics were not reported beyond the timeline (quarter-level) claim.
Summary states measured benefits were qualitative and process metrics; no detailed quantitative throughput/coverage numbers provided. (Meta-claim about the evidence reported.)
high null result ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... absence of detailed quantitative validation metrics in the reported results
Evaluation used seven benchmarks spanning online computer-use, offline computer-use, and multimodal tool-use reasoning tasks.
Benchmarks section in the summary states seven benchmarks covering those categories; no benchmark names or dataset sizes provided in the summary.
high null result Anticipatory Planning for Multimodal AI Agents benchmark task performance (task success, generalization)
Objectives combine trajectory-level rewards (for global consistency) with stepwise grounded rewards derived from execution outcomes.
Method summary explicitly lists these objectives as part of the TraceR1 training procedure.
high null result Anticipatory Planning for Multimodal AI Agents global plan consistency and stepwise execution outcomes
TraceR1 focuses on short-horizon trajectory forecasting to keep predictions tractable while capturing near-term consequences of actions.
Framework description in summary that emphasizes 'short-horizon trajectory forecasting' as a design choice.
high null result Anticipatory Planning for Multimodal AI Agents forecast horizon (short-horizon) / tractability of predictions
During grounded fine-tuning, tools are treated as frozen agents and only the policy is adjusted using execution feedback (tools are not modified).
Explicit statement in Data & Methods section of the summary describing tool handling during grounded fine-tuning.
high null result Anticipatory Planning for Multimodal AI Agents policy adaptation to tool execution feedback / tool-compatibility of executed ac...
Stage 2 of TraceR1 is a grounded fine-tuning phase that refines step-level accuracy and executability using execution feedback from frozen tool agents.
Method description in summary: Stage 2 — Grounded fine-tuning using execution feedback; tools are not retrained (treated as frozen agents) and feedback is used to adjust the policy.
high null result Anticipatory Planning for Multimodal AI Agents step-level execution accuracy and executability
TraceR1 uses a two-stage training procedure: Stage 1 trains trajectory-level RL on predicted short-horizon trajectories with rewards that enforce global consistency.
Method description in summary: Stage 1 — Trajectory-level RL with trajectory-level rewards to encourage global consistency across predicted action-state sequences.
high null result Anticipatory Planning for Multimodal AI Agents trajectory-level plan coherence / global consistency
Evaluation combined verifiability checks (fact/claim accuracy where possible), qualitative coding of strategic reasoning, and longitudinal comparison across nodes.
Methods description detailing a mixed evaluation approach: verifiability checks for factual items, qualitative coding for strategic narratives, and longitudinal comparisons over the 11 nodes.
high null result When AI Navigates the Fog of War evaluation components (verifiability checks, qualitative coding, longitudinal an...
The evaluation was conducted at 11 discrete temporal nodes during the crisis to capture changing public information and uncertainty.
Methods specification: definition and use of 11 temporal nodes as the backbone of the temporally grounded evaluation.
high null result When AI Navigates the Fog of War number of temporal nodes
The study used 42 node-specific verifiable questions plus 5 broader exploratory prompts to probe factual inferences and higher-level strategic reasoning.
Methods specification: explicit count of 42 verifiable, node-specific questions and 5 exploratory prompts designed by the study authors.
high null result When AI Navigates the Fog of War number and type of questions/prompts used
Measuring the marginal cost of runtime governance, the tradeoff curve between task completion and compliance risk, and calibrating violation probabilities are open empirical research questions identified by the paper.
Explicit list of open problems and proposed empirical research agenda in the Implications/Measurement sections of the paper.
high null result Runtime Governance for AI Agents: Policies on Paths existence of empirical research gaps (identified/not identified)
No large empirical dataset or large-scale field experiments were used; the work is primarily theoretical/formal with simulations and worked examples rather than empirical validation.
Paper's Methods/Data section explicitly states the work is theoretical/formal and lists reference implementation and simulations instead of large empirical studies.
high null result Runtime Governance for AI Agents: Policies on Paths use of empirical data (presence/absence of large-scale empirical evaluation)
Risk calibration—mapping violation probabilities to enforcement actions and thresholds—is a key unsolved operational problem for runtime governance.
Paper highlights open problems including risk calibration; argued via conceptual analysis and operational concerns (false positives/negatives, costs of blocking actions).
high null result Runtime Governance for AI Agents: Policies on Paths existence of calibrated thresholds and procedures (presence/absence)
Because the sample is non-representative (support-group recruitment and media cases) and small (19 users), the authors note that generalizability is limited and the sample is biased toward more severe cases.
Limitations section stating recruitment sources, small N, and bias toward severe cases.
high null result Characterizing Delusional Spirals through Human-LLM Chat Log... representativeness and generalizability of the sample
The study analyzed conversation logs from 19 users who reported psychological harm associated with chatbot use, comprising a total corpus of 391,562 messages (user + chatbot).
Dataset described in paper: 19 users' conversation logs aggregated; total message count reported as 391,562 messages across user and chatbot messages.
high null result Characterizing Delusional Spirals through Human-LLM Chat Log... size of dataset (number of users and total messages)
Two Doherty power amplifier prototypes with GaN HEMT transistors and three-port pixelated combiners were fabricated and tested at 2.75 GHz.
Paper reports fabrication of two prototypes built with GaN HEMT transistors and the optimized three-port pixelated combiners; RF characterization performed at 2.75 GHz.
high null result Deep Learning-Driven Black-Box Doherty Power Amplifier with ... number of fabricated prototypes and test frequency (2.75 GHz)
Key measurable metrics for future evaluation include contest frequency and outcomes, time-to-help for different groups, user satisfaction, perceived fairness, incidence of automation bias, and usability/access disparities.
List of proposed metrics in the paper's evaluation agenda.
high null result Designing for Disagreement: Front-End Guardrails for Assista... the specified metrics (contest frequency/outcomes, time-to-help, satisfaction, p...
The paper does not report empirical data; instead it provides a vignette and a proposed evaluation agenda (user studies, field pilots, A/B tests, logs, surveys).
Explicit methodological statement in the Data & Methods section summarised by the authors; factual description of the paper's empirical status.
high null result Designing for Disagreement: Front-End Guardrails for Assista... presence/absence of empirical data in the paper (binary)
The pattern provides an outcome-specific, easy-to-use contest channel allowing users to contest particular decisions without renegotiating global rules.
Design element described in the paper and exemplified in the vignette; proposed contest metrics and evaluation agenda but no empirical data.
high null result Designing for Disagreement: Front-End Guardrails for Assista... availability and specificity of contest channels (system functionality)
The pattern requires legibility at the contact point so the robot clearly communicates which active mode is in use and why when deferring or prioritizing.
Design specification and rationale in the paper; supported by the public-concourse vignette; no empirical measurement.
high null result Designing for Disagreement: Front-End Guardrails for Assista... legibility of active mode (user understanding at time of deferral)
The pattern constrains prioritization to a governance-approved menu of admissible modes, limiting the policy space to vetted options.
Design specification in the paper (architectural requirement); illustrated in the vignette; no empirical testing.
high null result Designing for Disagreement: Front-End Guardrails for Assista... existence of governance-approved admissible modes (system design property)
Metrics used to evaluate agents include operational stability (e.g., variance or frequency of catastrophic failures), efficiency (e.g., cost/profit/fulfillment), and degradation across increasing task complexity.
Methods and experimental sections specifying the metrics applied to compare ESE and baselines on RetailBench environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... operational stability, efficiency, and robustness/degradation metrics
Baselines used in comparisons include monolithic LLM agents and other existing agent architectures that do not implement explicit strategy/execution separation.
Experimental design: baseline descriptions in the methods section specifying monolithic LLM agents and additional architectures lacking explicit temporal decomposition.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... baseline agent architectures used for comparison
Eight state-of-the-art LLMs were evaluated in the study.
Experimental setup description listing eight contemporary LLMs tested across RetailBench environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... number of LLMs evaluated (n = 8)
The paper proposes Evolving Strategy & Execution (ESE), a two-tier architecture that separates high-level strategic reasoning (updated at a slower temporal scale) from low-level execution (short-term action selection).
Architectural design described in the methods: explicit decomposition into strategy and execution modules with differing update cadences and stated interpretability/adaptation mechanisms.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... agent architectural modularity (temporal decomposition into strategy vs executio...
RetailBench environments are progressively challenging to stress-test adaptation and planning capabilities (i.e., environments increase in complexity, stochasticity, and non-stationarity).
Benchmark construction described in the paper: multiple environment difficulty levels used to evaluate degradation under increasing challenge; experiments run across these progressive environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... environment difficulty gradient (complexity/stochasticity/non-stationarity level...
The paper introduces RetailBench, a high-fidelity long-horizon benchmark for realistic commercial decision-making under stochastic demand and evolving external conditions (non-stationarity).
Design and presentation of the benchmark in the paper: simulated commercial operations with stochastic demand processes and shifting external factors; emphasis on long-horizon evaluation and progressively challenging environments.
high null result RetailBench: Evaluating Long-Horizon Autonomous Decision-Mak... benchmark realism and coverage of non-stationarity for long-horizon decision-mak...
Roughly 25% of the training corpus is Italian-language data.
Corpus composition reported by the authors: Italian-language share ≈25% of total training tokens. The summary cites this proportion but does not list the datasets or language-detection methodology.
high null result EngGPT2: Sovereign, Efficient and Open Intelligence percentage share of Italian-language tokens in the training corpus
The model was trained on approximately 2.5 trillion tokens of data.
Training-data size reported in the paper (aggregate token count ≈2.5T). The summary provides this number; no per-dataset breakdown or provenance details are included in the summary.
high null result EngGPT2: Sovereign, Efficient and Open Intelligence total number of training tokens
Approximately 3 billion parameters are active per inference (sparse activation / ~3B active parameters at runtime).
Paper reports sparse MoE design with ≈3B active parameters per forward pass. Evidence comes from model design description (active set / routing), not from independent runtime FLOP logs in the summary.
high null result EngGPT2: Sovereign, Efficient and Open Intelligence active parameters used per inference
EngGPT2-16B-A3B is a Mixture-of-Experts (MoE) model trained from scratch with a total of 16 billion parameters.
Model specification reported in the paper: architecture described as MoE and total parameter count listed as 16B. No contrary empirical test needed; claim is a declarative model spec.
high null result EngGPT2: Sovereign, Efficient and Open Intelligence model architecture and total parameter count
The project developed domain- and specialty-focused models: Fanar-Sadiq (Islamic content multi-agent architecture), Fanar-Diwan (classical Arabic poetry), and FanarShaheen (bilingual translation).
Paper enumerates these domain/specialty models and their stated focuses as part of the product stack.
high null result Fanar 2.0: Arabic Generative AI Stack existence and intended domain of specialized models
FanarGuard is a 4B bilingual moderation model focused on Arabic safety and cultural alignment.
Paper lists FanarGuard in the expanded product stack and specifies model size (4B) and bilingual moderation purpose emphasizing Arabic safety/cultural alignment.
high null result Fanar 2.0: Arabic Generative AI Stack model existence, size (4B), and intended function (bilingual moderation)
Fanar-27B was produced by continual pre-training from a Gemma-3-27B 27B backbone.
Paper describes model development: continual pre-training of Fanar-27B from the Gemma-3-27B 27B backbone.
high null result Fanar 2.0: Arabic Generative AI Stack model lineage/architecture (Fanar-27B ← Gemma-3-27B)