The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (13870 claims)

Adoption
8467 claims
Productivity
7558 claims
Governance
6805 claims
Human-AI Collaboration
6363 claims
Org Design
4132 claims
Innovation
4065 claims
Labor Markets
3526 claims
Skills & Training
2945 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 749 196 98 892 1984
Governance & Regulation 817 394 188 121 1544
Organizational Efficiency 771 189 124 83 1177
Technology Adoption Rate 627 233 123 96 1088
Research Productivity 411 123 56 332 933
Output Quality 467 178 59 47 751
Decision Quality 320 174 75 42 618
Firm Productivity 435 55 88 20 604
AI Safety & Ethics 214 276 65 33 593
Market Structure 178 167 122 24 496
Task Allocation 207 64 71 32 379
Skill Acquisition 165 59 60 17 301
Innovation Output 203 27 43 18 292
Employment Level 105 52 107 13 279
Fiscal & Macroeconomic 131 69 43 26 276
Consumer Welfare 116 63 42 11 232
Firm Revenue 150 48 26 3 227
Inequality Measures 44 122 49 6 221
Task Completion Time 169 29 8 12 219
Worker Satisfaction 89 63 20 12 184
Error Rate 69 92 10 2 173
Regulatory Compliance 76 68 14 5 163
Training Effectiveness 93 21 13 19 148
Wages & Compensation 77 36 25 6 144
Automation Exposure 51 54 22 12 142
Team Performance 86 17 27 9 140
Developer Productivity 94 17 14 6 132
Job Displacement 12 80 20 1 113
Hiring & Recruitment 51 7 8 3 69
Creative Output 31 17 7 3 59
Skill Obsolescence 5 46 6 1 58
Social Protection 27 16 8 2 53
Labor Share of Income 17 17 17 51
Worker Turnover 11 12 3 26
Industry 1 1
Speedrunning Track includes an open-source multi-agent orchestration system and standardized evaluation scenarios for reproducible multi-agent comparisons.
Paper describes and releases an open-source orchestration harness for orchestrating LLMs/agents and provides standardized scenarios and evaluation tools meant for reproducibility.
high positive The PokeAgent Challenge: Competitive and Long-Context Learni... availability of open-source orchestration code and standardized evaluation scena...
Community interest in the benchmark was validated by a NeurIPS 2025 competition with 100+ teams and published analyses of winning submissions.
Paper reports organization/validation via a NeurIPS 2025 competition, states participation of 100+ teams, and includes documentation/analyses of top submissions.
high positive The PokeAgent Challenge: Competitive and Long-Context Learni... number of competing teams (100+), availability of competition analyses/winning s...
The project is a living benchmark: the Battling Track has a live leaderboard and the Speedrunning Track uses self-contained evaluation to ensure reproducibility.
Paper/documentation notes a live leaderboard for Battling and provides self-contained evaluation pipelines/orchestration for Speedrunning intended to support reproducible runs.
high positive The PokeAgent Challenge: Competitive and Long-Context Learni... presence of live leaderboard and self-contained evaluation pipelines
Baselines include heuristic rule-based agents, reinforcement-learning (RL) agents trained for specialist play, and LLM-based agents/harnesses for generalist approaches.
Paper presents baseline implementations and experiments spanning heuristic, RL, and LLM-based agents and describes training procedures and architectures used for each baseline category.
high positive The PokeAgent Challenge: Competitive and Long-Context Learni... presence and types of baseline agents (heuristic, RL, LLM)
The benchmark is split into two complementary tracks: a Battling Track (competitive, partial-observability battles) and a Speedrunning Track (long-horizon RPG tasks with a multi-agent orchestration harness).
Paper structure and dataset descriptions specify two tracks, their scopes, and the inclusion of a multi-agent orchestration system for the Speedrunning Track.
high positive The PokeAgent Challenge: Competitive and Long-Context Learni... benchmark partitioning (presence of Battling and Speedrunning tracks)
The Battling Track dataset contains more than 20 million recorded battle trajectories.
Paper reports a Battling Track dataset of >20M recorded battle trajectories collected from simulated/match play; size reported explicitly in dataset and methods section.
high positive The PokeAgent Challenge: Competitive and Long-Context Learni... number of recorded battle trajectories (>20,000,000)
PokeAgent Challenge is a large, realistic multi-agent benchmark built on Pokemon that stresses partial observability, game-theoretic reasoning, and long-horizon planning simultaneously.
Paper describes design and motivation of the benchmark, detailing two tracks (Battling and Speedrunning) intended to capture partial observability, adversarial/game-theoretic interactions, and long-horizon sequential planning; benchmark implementation built on Pokemon simulator and described task specifications.
high positive The PokeAgent Challenge: Competitive and Long-Context Learni... benchmark task characteristics (partial observability, game-theoretic complexity...
iDaVIE's modular architecture supports extensibility (planned features include subcube loading, advanced render modes, video scripting, and collaborative VR sessions).
Paper describes modular architecture and lists planned/possible future features; this is a software design claim rather than an empirical result.
high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... software extensibility and planned feature set
Because iDaVIE is open-source and extensible, software licensing costs are low and marginal adoption costs fall over time.
Paper states iDaVIE is open-source and designed for community-driven enhancements; economic claim based on general properties of open-source software rather than empirical cost accounting.
high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... licensing cost implication and marginal adoption costs
iDaVIE includes interaction features such as selection, cropping/subcube tools, catalogue overlays, and export back to existing pipelines.
Feature list in paper describing selection, cropping, overlays, in-VR metrics and export functionality; demonstrated integration to export edited masks/subcubes.
high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... availability and functionality of in-VR interaction and export tools
Streaming and downsampling pipelines implemented as Unity plug-ins make large volumes interactively viewable in VR while preserving needed detail for inspection.
Technical description of custom Unity plug-ins for streaming/downsampling and on-the-fly statistics; tested on HI cubes (telescopes listed) per the paper.
high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... interactive rendering performance and retention of inspection-relevant detail
iDaVIE (v1.0) is a working VR software suite that lets astronomers import, render, inspect, and interactively edit very large 3D data cubes in real time.
Described implementation of iDaVIE v1.0 built on Unity/SteamVR with custom plug-ins for parsing/downsampling and real-time rendering; tested on large 3D spectral (HI) cubes from radio telescopes (MeerKAT, ASKAP, APERTIF) as reported in the paper.
high positive iDaVIE v1.0: A virtual reality tool for interactive analysis... ability to import/render/inspect/edit large 3D data cubes in real time (interact...
Personalized LLM coaching produced a statistically significant increase in alignment with the normative empathic taxonomy relative to both the video-based non-personalized feedback and control arms.
Pre-registered randomized experiment with three arms; pre-registered analysis reported statistically significant differences favoring personalized coaching on the primary alignment outcome.
high positive Practicing with Language Models Cultivates Human Empathic Co... statistical difference in alignment to normative empathic patterns (primary outc...
A brief, personalized coaching intervention delivered by a large language model significantly improves participants' alignment with normative, idiomatic empathic communication patterns.
Pre-registered randomized controlled trial with three arms (personalized LLM coaching, video-based non-personalized feedback, control). Outcome measured as alignment to a data-driven normative taxonomy via coding/automated measures. Overall corpus and sample context: 968 participants, 2,904 conversations, 33,938 messages used in the study.
high positive Practicing with Language Models Cultivates Human Empathic Co... alignment with normative empathic patterns (coding/automated alignment metrics)
HindSight reveals a large, real difference between systems that is missed by LLM-based judging (i.e., HindSight detects the retrieval-augmentation advantage while LLM-judged metrics do not).
Combined empirical results: HindSight shows a 2.5× advantage (p < 0.001) for retrieval augmentation while LLM-as-Judge reports no significant difference (p = 0.584).
high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... Detection of performance difference between retrieval-augmented and vanilla gene...
Experiments in the paper cover 10 AI/ML research topics and use a 30-month forward evaluation window.
Experimental setup reported in the paper: scope explicitly stated as 10 AI/ML topics and a 30-month forward window after cutoff T.
high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... Scope parameters (number of topics = 10; forward window length = 30 months)
Generated ideas can be algorithmically compared to future publications and matched items can be assigned scores reflecting downstream impact (citation counts and venue acceptance).
Method section: description of algorithmic matching procedure and scoring rules that use citation counts and venue acceptance as impact proxies.
high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... Match indicators and downstream-impact scores (citations, venue acceptance) for ...
A retrieval-augmented idea generator produces 2.5× higher-scoring ideas than a vanilla generator according to HindSight (p < 0.001).
Empirical comparison reported in the paper across the specified experiments (10 AI/ML topics, time-split at T, 30-month forward window); statistical test reporting a 2.5× difference with p < 0.001.
high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... HindSight score (downstream-impact-based score for generated ideas)
HindSight is a time-split, retrospective evaluation that (1) restricts idea generation to pre-cutoff literature (time T), (2) compares generated ideas to papers published in the following 30 months, and (3) scores matches by downstream impact (citation counts and venue acceptance).
Method described in paper: time-split protocol with a temporal cutoff T, a 30-month forward window, algorithmic matching of generated ideas to later publications, and scoring based on downstream impact metrics (citations and venue acceptance).
high positive HindSight: Evaluating LLM-Generated Research Ideas via Futur... HindSight match score computed from matches to later publications weighted by ci...
The paper introduces a Multi-Object Decoder (MOD) that extends SAM 3D to jointly reconstruct multiple objects from a single image, targeting physically plausible, non-penetrating object configurations and realistic contacts.
Method section: MOD is described as an extension of the single-object SAM 3D architecture to jointly decode multiple object shapes and poses from a monocular image; the method explicitly aims to reduce inter-object penetration and model contacts.
high positive MessyKitchens: Contact-rich object-level 3D scene reconstruc... methodological capability: joint multi-object monocular 3D reconstruction, objec...
LEAFE achieves up to a 14% absolute improvement on Pass@128 versus the strongest baselines.
Empirical result explicitly reported in the paper: maximum observed improvement 'up to +14% Pass@128' in comparisons to baselines on the experimental tasks.
high positive Internalizing Agency from Reflective Experience Pass@128 (absolute percentage point improvement)
Compared with outcome-driven methods (e.g., GRPO) and experience-based baselines (e.g., Early Experience), LEAFE yields consistent gains in Pass@1 and Pass@k under fixed interaction budgets.
Head-to-head experimental comparisons reported between LEAFE and baselines GRPO and Early Experience on the task suite; fixed interaction-budget experimental regime; Pass@1 and Pass@k used as evaluation metrics.
high positive Internalizing Agency from Reflective Experience Pass@1 and Pass@k (fraction of problems solved among k candidate runs)
LEAFE substantially improves long-horizon agentic performance by internalizing recovery behavior learned from environment feedback.
Reported experiments on a suite of long-horizon interactive tasks (multi-step coding and agentic tasks) comparing LEAFE to baselines; evaluation using Pass@k metrics under fixed interaction budgets; qualitative description that LEAFE internalizes recovery behavior from environment feedback.
high positive Internalizing Agency from Reflective Experience Long-horizon agentic performance measured by Pass@k (Pass@1, Pass@k, Pass@128)
The RL fine-tuned Qwen2.5-Coder-7B improves 33.1% over the same base 7B model without RL fine-tuning.
Head-to-head comparison between the tuned model and its untuned base across the 48 evaluation briefs; reported improvement of +33.1%.
high positive Learning to Present: Inverse Specification Rewards for Agent... Absolute or relative quality improvement (%) of tuned vs. untuned Qwen2.5-Coder-...
Fine-tuning a parameter-efficient 7B model (Qwen2.5-Coder-7B) via reinforcement learning in an OpenEnv-compatible environment yields near-state-of-the-art automated slide-generation: the tuned 7B model reaches 91.2% of Claude Opus 4.6’s quality.
Empirical evaluation on 48 diverse business briefs comparing six models; reported relative quality score of tuned Qwen2.5-Coder-7B = 91.2% of Claude Opus 4.6.
high positive Learning to Present: Inverse Specification Rewards for Agent... Relative slide-generation quality (percent of Claude Opus 4.6 quality) across 48...
Managing captures, traces, and replay sessions from a unified single design database ensures consistency across replay targets and sessions.
Method description emphasizes a single design database coordinating captures and replays across simulation and emulation for the demonstrator system. (Operational claim demonstrated in the implementation; no metrics on error reduction provided.)
high positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... consistency of trace/replay data and configuration across targets
The captured traces can be deterministically replayed across different execution targets (software/hardware simulation and hardware emulation), reducing cross-platform setup complexity and discrepancies.
The same captured waveforms/traces were replayed on both simulation and emulation environments for the ODIN demonstrator; cross-target replay was part of the described method. (Demonstrated on the single reported system; no broad cross-toolchain study provided.)
high positive ODIN-Based CPU-GPU Architecture with Replay-Driven Simulatio... consistency of reproduced behavior across simulator and emulator targets
Using the proportional veto core provides formal protection for minority blocs by giving them proportional blocking power, thus encoding a proportional fairness guarantee compared to simple majoritarian rules.
Definition and properties of the proportional veto core presented in the paper; conceptual discussion comparing veto/proportionality guarantees to majoritarian outcomes.
high positive Finding Common Ground in a Sea of Alternatives existence of proportional blocking power / protection for minority groups as for...
The paper characterizes the information cost of aggregating preferences when AI can generate essentially unlimited candidate alternatives by providing tight sample-complexity bounds and lower bounds.
The combination of sampling-model formalization, sample-complexity upper bounds, and matching lower bounds constitutes a formal characterization of the information (sample) requirements.
high positive Finding Common Ground in a Sea of Alternatives sample/query complexity as the measure of information cost
The authors prove an upper bound on the number of samples/queries required by their algorithm as a function of accuracy, confidence, and problem parameters.
Theoretical analysis in the paper deriving explicit sample-complexity upper bounds (stated as functions of accuracy/confidence and relevant parameters).
high positive Finding Common Ground in a Sea of Alternatives sample/query complexity required for the algorithm to achieve specified accuracy...
Under only query (sampling) access to the unknown joint distribution of voters and alternatives, there is an efficient sampling-based algorithm that, with high probability, returns an alternative in the approximate proportional veto core.
Constructive algorithm and correctness proof in the paper showing the algorithm returns an approximate core alternative with high probability under the sampling access model.
high positive Finding Common Ground in a Sea of Alternatives probability that the algorithm's output lies in the approximate proportional vet...
The paper formalizes the proportional veto core for settings with an infinite alternative space and voters whose preferences are drawn from an unknown distribution.
Formal model and definitions presented in the paper: extension of the proportional veto core to an infinite alternative space and definitions for sampling-appropriate approximate proportional veto core.
high positive Finding Common Ground in a Sea of Alternatives formal definition / existence of an appropriate approximate proportional veto-co...
Temporally grounding model inputs (constraining models to contemporaneous public information at each node) substantially reduces the risk of training-data leakage and hindsight bias.
Study design enforced node-specific contemporaneous evidence constraints for each of the 11 nodes; methodological rationale and comparison to unconstrained settings described as reducing retrospective information contamination.
high positive When AI Navigates the Fog of War presence/absence or reduction of training-data leakage/hindsight bias (procedura...
BATQuant significantly outperforms prior post-training quantization (PTQ) methods on MXFP4 microscaling floating-point formats under aggressive quantization.
Comparative experiments against rotation-based PTQ techniques and other existing PTQ baselines on the described multimodal and language tasks; improvements shown in benchmark metrics and recovery percentages in the paper's experimental section.
high positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Task-specific accuracy/quality metrics and percent recovery relative to full-pre...
BATQuant recovers up to 96.43% of full-precision performance under aggressive W4A4KV16 quantization on MLLMs and LLMs.
Empirical evaluation reported in the paper: experiments on multiple multimodal large language models (MLLMs) and standard LLMs using an aggressive W4A4KV16 quantization setup; performance reported as percentage of full-precision performance recovered (specific models, benchmark names, and exact sample sizes not enumerated in the summary).
high positive BATQuant: Outlier-resilient MXFP4 Quantization via Learnable... Percentage of full-precision performance recovered (model quality/accuracy on mu...
The paper provides concrete, regulation-inspired policy examples (e.g., content prohibition, sensitive data exfiltration) showing how they map into the Policy function.
Worked, illustrative examples included in the paper mapping regulatory constraints to the Policy(agent_id, partial_path, proposed_action, org_state) formalism.
high positive Runtime Governance for AI Agents: Policies on Paths representability of regulation-inspired policies in the formalism (yes/no; examp...
Runtime policy evaluation can intercept, score, log, allow/modify/block actions, and update organizational state as part of an agent's execution loop (reference implementation architecture).
Reference implementation design described in the paper (runtime policy evaluator hooks, logging, enforcement actions); architectural reasoning and pseudo-workflows provided; no production deployment data.
high positive Runtime Governance for AI Agents: Policies on Paths feasibility of integrating runtime policy evaluator into agent loops (architectu...
Policies can be formalized as deterministic functions p_violation = Policy(agent_id, partial_path, proposed_action, org_state) that return a probability or score of violation for a proposed next action.
Formal definition and mapping in the paper; worked examples showing how regulatory-style constraints map into this function; no large-scale empirical validation.
high positive Runtime Governance for AI Agents: Policies on Paths expressiveness of policy formalism (ability to represent targeted constraints)
Effective governance for agentic LLM systems requires treating the execution path as the central object and performing runtime evaluation of proposed next actions given the partial path.
Theoretical argument and formal proposal of runtime policy evaluator that takes (agent_id, partial_path, proposed_action, org_state) and returns a violation probability; reference architecture described; illustrative examples.
high positive Runtime Governance for AI Agents: Policies on Paths governance effectiveness for path-dependent policies (qualitative/coverage)
Multiple off-the-shelf vision-language models (closed-source and open-source) representative of current state-of-the-art architectures were benchmarked.
Paper reports experiments across a mix of closed-source and open-source VLMs; exact model names provided in the released materials.
high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... models evaluated (variety and representativeness)
Evaluation targets include correctness, consistency, and update efficacy, operationalized via quantitative metrics (accuracy, consistency rates, update success rate).
Methods section describing evaluation metrics and how correctness, consistency, and update efficacy are measured across experiments.
high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... metrics used: accuracy, consistency rate, update success rate
A curated set of time-sensitive factual items (e.g., officeholders, company statuses, recent awards/results) was used to construct the benchmark.
Benchmark composition description listing categories of time-sensitive facts and methodology for curation of items used in experiments.
high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... composition of benchmark item set
The authors release the V-DyKnow benchmark, code, and evaluation data for community use.
Statement in paper and accompanying release materials indicating benchmark, code, and evaluation data are publicly available.
high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... availability of benchmark, code, and data
V-DyKnow is a benchmark specifically designed to evaluate time-sensitive factual knowledge in vision-language models across both text and image modalities.
Release and description of the benchmark in the paper: curated set of time-sensitive factual items, paired multimodal stimuli (text + images), input perturbations, and evaluation scripts. Methodological description of benchmark composition and tasks.
high positive V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge i... benchmark existence / capability to evaluate time-sensitive multimodal factual k...
Ethical handling: the study involved sensitive material (self-harm, trauma) and authors applied validation and careful handling consistent with research ethics.
Ethics section and methods describing sensitivity of material and precautions taken in data handling and validation.
high positive Characterizing Delusional Spirals through Human-LLM Chat Log... ethical procedures applied to sensitive data
Selected coded items (for example, suicidal messages) were validated by the authors to increase reliability of certain critical annotations.
Methods section describing validation procedures applied to selected items such as suicidal ideation.
high positive Characterizing Delusional Spirals through Human-LLM Chat Log... validation status of coded items (e.g., number of validated suicidal messages)
The authors developed and applied a manual codebook of 28 behavioral/phenomenological codes (e.g., delusional thinking, suicidal ideation, chatbot sentience claims, romantic interest) across the full corpus.
Method section describing construction of a 28-code inventory and manual coding applied to entire dataset.
high positive Characterizing Delusional Spirals through Human-LLM Chat Log... existence and application of a 28-code annotation scheme
The surrogate-driven inverse-design pipeline transfers to physical hardware — designs produced by the CNN+GA pipeline were realized and validated experimentally.
Two fabricated prototypes implemented the optimized pixelated combiners and GaN HEMT Doherty PAs; measured performance metrics correspond to the designs, demonstrating transfer from surrogate-driven design to hardware.
high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... consistency between surrogate-driven design outputs and measured prototype perfo...
Under a 20 MHz 5G-NR-like waveform (9 dB PAPR) with digital predistortion (DPD), each prototype reached average PAE greater than 51% while meeting ACLR ≤ −60.8 dBc.
Realistic waveform testing described: a 20 MHz 5G‑NR-like signal with 9 dB PAPR was applied to the prototypes, DPD was used, and measurements reported average PAE > 51% and ACLR ≤ −60.8 dBc for each prototype.
high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... average power-added efficiency (PAE %) and adjacent channel leakage ratio (ACLR,...
Each prototype demonstrated drain efficiency greater than 52% at 9 dB back-off.
Back-off efficiency measurements reported for the fabricated prototypes showing drain efficiency > 52% at 9 dB back-off.
high positive Deep Learning-Driven Black-Box Doherty Power Amplifier with ... drain efficiency at 9 dB back-off (%)