The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲

Evidence (6574 claims)

Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome Positive Negative Mixed Null Total
Other 761 200 101 904 2020
Governance & Regulation 829 400 191 122 1566
Organizational Efficiency 784 193 125 84 1197
Technology Adoption Rate 637 236 124 97 1103
Research Productivity 431 131 58 340 972
Output Quality 481 183 59 47 770
Decision Quality 332 177 82 49 647
Firm Productivity 439 57 88 20 610
AI Safety & Ethics 218 279 66 33 602
Market Structure 181 170 123 24 503
Task Allocation 214 64 72 33 388
Skill Acquisition 174 62 62 17 315
Innovation Output 204 27 45 18 295
Employment Level 105 54 108 13 282
Fiscal & Macroeconomic 132 69 43 26 277
Consumer Welfare 117 63 42 11 233
Firm Revenue 154 48 26 3 231
Task Completion Time 173 31 8 12 225
Inequality Measures 44 123 50 6 223
Worker Satisfaction 89 65 22 12 188
Error Rate 71 92 10 2 175
Regulatory Compliance 77 69 14 5 165
Automation Exposure 58 56 26 13 156
Training Effectiveness 96 21 14 19 152
Wages & Compensation 77 37 25 6 145
Team Performance 86 17 27 10 141
Developer Productivity 95 17 14 6 133
Job Displacement 12 81 21 1 115
Hiring & Recruitment 52 7 8 3 70
Creative Output 32 20 8 3 64
Skill Obsolescence 5 47 6 1 59
Social Protection 28 16 8 2 54
Labor Share of Income 17 19 17 53
Worker Turnover 11 12 3 26
Industry 1 1
Clear
Human Ai Collab Remove filter
Given the mixed outcomes (some improvements, some new lint/security issues), stronger tool-in-the-loop quality and security gating is motivated for AI-driven development workflows.
Interpretation/recommendation based on observed mix of improvements and introduced issues from the empirical results (PyQu, Pylint, Bandit analyses) and high merge rates.
high positive Quality and Security Signals in AI-Generated Python Refactor... policy/process recommendation (quality/security gating)
73.5% of the analyzed PRs are merged (developer acceptance is high).
Empirical measurement of PR outcomes (merged vs. not merged) in the AIDev dataset of Python refactoring PRs.
high positive Quality and Security Signals in AI-Generated Python Refactor... PR merge rate (acceptance)
Usability is the quality attribute that improves most frequently, improving in 36.5% of the studied changes.
PyQu-based before-and-after analysis of quality attributes on Python refactoring PRs from the AIDev dataset; reported frequency for the 'usability' attribute.
high positive Quality and Security Signals in AI-Generated Python Refactor... usability (one of PyQu's quality attributes)
Agentic commits improve a quality attribute in 22.5% of the studied changes.
Empirical analysis of Python refactoring pull requests from the AIDev dataset using PyQu (an ML-based Python quality assessment tool) to compare quality attributes before and after each change.
high positive Quality and Security Signals in AI-Generated Python Refactor... improvement in any measured code quality attribute (per change)
The proposed taxonomy advances understanding and provides a structured framework for studying emerging human–algorithmic supervisory arrangements in organizations.
Authors' asserted contribution based on literature synthesis and their taxonomy derived from analysis of 14 real-world settings; intended to guide future research.
high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation
We demonstrate the taxonomy’s applicability through three ACoS examples.
Authors state they applied the taxonomy to three examples (case applications) to show applicability; abstract reports N=3 examples.
high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation
We identify two meta-dimensions, control collaboration and control enactment, and six dimensions that enable researchers to categorize and compare ACoS across organizations.
Taxonomy derived from the authors' analysis (14 real-world settings) and literature synthesis; specific dimensions enumerated in paper (as summarized in abstract).
high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation
Building on prior literature and an analysis of 14 real-world ACoS settings, we propose a taxonomy that conceptualizes the phenomenon.
Method stated in abstract: literature review plus qualitative/empirical analysis of 14 real-world ACoS settings; taxonomy presented as an output.
high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation
Organizations increasingly weave algorithmic systems into control processes.
Statement supported by prior literature review and the paper's motivating statements (no specific empirical trend data reported in abstract).
high positive A Taxonomy Of Algorithmic Co-Supervision adoption_rate
Current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution.
Qualitative and/or quantitative evaluation results in paper indicating strengths in spatial grounding, multimodal alignment, and coordinated action execution.
high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... spatial grounding, multimodal alignment, coordinated action execution
We develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding.
Methodological contribution described in paper: parser implementation that converts recordings and logs into structured GUI action trajectories.
high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... ability to produce structured, grounded GUI action trajectories from recordings/...
The tasks involve dense multimodal interfaces and tightly coupled interaction sequences.
Task descriptions and dataset characteristics in paper stating tasks are complex, long-horizon, multimodal, and tightly coupled.
high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... interface complexity and interaction coupling in tasks
We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows.
Dataset construction reported in paper: curated expert demonstrations spanning 7 applications and 186 tasks (numbers provided in text).
high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... size and scope of demonstration dataset (number of applications and tasks)
We introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments.
Paper describes the creation of the Cutverse benchmark as a central contribution (design and implementation described in methods).
high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... existence and design of a benchmark for GUI agents in media post-production
GUI agents have made significant progress in web navigation and basic operating system tasks.
Background claim stated in paper referencing prior work on GUI agents applied to web navigation and OS tasks (no specific experiments in this paper to support it).
high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... capability progress on web navigation and OS tasks
The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), enabling sustained operation beyond full-context limits.
Reported stress test / capability demonstration in paper: profile size stated as 14,000+ facts and 125k tokens stored and managed by the system.
high positive Episodic-Semantic Memory Architecture for Long-Horizon Scien... number of scientific facts and token footprint the system can manage (profile ca...
The Dual Process system maintains 70-85% accuracy with 1-2 second latency while using 62% fewer tokens (45,434 vs 120,000+ limit) compared to full-context approaches.
Reported empirical results from the large-scale evaluation (1,440 queries / 15,000 messages) comparing Dual Process to full-context models; exact accuracy, latency, and token-count figures provided in the paper.
high positive Episodic-Semantic Memory Architecture for Long-Horizon Scien... accuracy; latency (seconds); token usage
The Dual Process Memory Architecture decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message).
System design description and measured consolidation growth rate reported in the paper; empirical observation of growth rate stated.
high positive Episodic-Semantic Memory Architecture for Long-Horizon Scien... episodic window size; long-term memory growth rate (tokens/message)
Experiments on real-world and synthetic tabular datasets show that SPN consistently improves robustness and predictive performance under strategic manipulation compared with both tabular foundation models and classical tabular methods.
Empirical experiments reported in the paper (on unspecified real-world and synthetic tabular datasets) comparing SPN to PFN-style tabular foundation models and classical tabular methods; the abstract claims consistent improvements but does not report sample sizes, dataset names, or quantitative effect sizes.
high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... robustness and predictive performance under strategic manipulation
SPN constructs strategic in-context examples to approximate post-manipulation inputs and aligns PFN predictions with the induced strategic distribution.
Description of SPN's mechanism in the paper (methodological detail). Presented as the approach used to approximate strategic post-manipulation inputs and align predictions; no quantitative details or sample sizes in the abstract.
high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... alignment of PFN predictions with induced strategic distribution
We propose Strategic Prior-data Fitted Network (SPN), an inference-time strategy-aware framework that adapts tabular foundation models to strategic environments without retraining.
Methodological contribution described in the paper: SPN is introduced as an inference-time framework that modifies behavior without retraining. This is a description of the proposed method rather than quantified empirical evidence; no sample sizes reported in the abstract.
high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... ability to adapt PFN-style models to strategic environments at inference time (n...
Tabular foundation models based on pretrained prior-data fitted networks (PFNs) have shown strong generalization on diverse tabular tasks, but they are typically designed for non-strategic settings where data distributions are independent of deployed classifiers.
Statement in the paper situating PFN-style tabular foundation models as having strong generalization in prior work and noting their design assumption of non-strategic, classifier-independent data distributions; no dataset/sample sizes provided in the abstract.
high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... generalization performance of PFN-style tabular foundation models on non-strateg...
Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.
Conclusion drawn from experimental findings that cleanliness materially influenced agent operational metrics (tokens and revisits) even when pass rates were unchanged.
high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... factors materially affecting agent behaviour (operational footprint/navigation)
Traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents.
Interpretation based on experimental results showing token and navigational efficiency gains on cleaner code (7–8% fewer tokens, 34% fewer revisitations) despite unchanged pass rates.
high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... relevance of maintainability principles to agent computational cost and navigati...
Agents working on cleaner code reduce file revisitations by 34%.
Empirical measurement across the same experimental trials comparing agent file-revisitation counts between clean and messy repo variants; reported 34% reduction in file revisitations on cleaner code.
high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... file revisitations (number of times agents revisit files)
Agents working on cleaner code use 7 to 8% fewer tokens.
Empirical measurement across trials (660 trials with Claude Code) comparing token consumption between clean and messy repository variants; reported decrease of 7-8% in tokens when working on cleaner code.
high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... token usage (number of tokens consumed by agent pipelines)
We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface.
Reported experimental design: 33 authored tasks spanning six repository pairs; evaluation used hidden tests executed at the application's public surface.
high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... number of tasks and pairs used in evaluation
The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one.
Method description: authors constructed pairs bidirectionally using agent pipelines that modify repositories to create matched clean/messy variants.
high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... directional construction of repository pairs (degrade or clean)
We introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity.
Methodological description in paper: construction of paired repositories controlling for architecture, dependencies, and external behaviour while varying static-analysis violations and cognitive complexity.
high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... evaluation protocol (minimal-pair control of repository cleanliness)
A simple prompt checklist can improve LLM responses while reducing unnecessary interaction.
Authors' interpretation/conclusion drawn from the experimental comparisons and rubric scores reported in the paper's results.
high positive Less Back-and-Forth: A Comparative Study of Structured Promp... output_quality and user_interaction
Checklist prompts produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts.
Reported comparative statement in the results that checklist prompts used fewer average tokens and produced a better quality-effort tradeoff (no token counts, sample size, or statistical tests reported in the abstract).
high positive Less Back-and-Forth: A Comparative Study of Structured Promp... average_tokens_used (user effort) and output_quality
Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts.
Reported mean rubric scores for each prompt condition in the paper's results (no sample sizes or significance tests provided in the abstract).
high positive Less Back-and-Forth: A Comparative Study of Structured Promp... rubric_score (task completion / correctness / compliance / clarity)
The authors open-source optimize_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa.
Explicit statement and provided GitHub URL in the paper excerpt.
high positive optimize_anything: A Universal API for Optimizing any Text P... availability of open-source code / tooling
Multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks.
Reported experiments comparing multi-task search versus independent per-problem optimization under equal per-problem budget; observed cross-task transfer benefits and that benefits increase with more related tasks.
high positive optimize_anything: A Universal API for Optimizing any Text P... optimization performance (e.g., score) under multi-task vs independent optimizat...
Ablations across three domains reveal that actionable side information yields substantially higher final scores than score-only feedback.
Same ablation studies across three domains as above; reported higher final optimization scores when using actionable side information compared to only score feedback.
Ablations across three domains reveal that actionable side information yields faster convergence than score-only feedback.
Paper reports ablation studies in three domains comparing optimization with actionable side information versus score-only feedback and finds faster convergence with side information.
high positive optimize_anything: A Universal API for Optimizing any Text P... convergence speed (time or iterations to converge)
The system outperforms AlphaEvolve's reported circle packing solution (n=26).
Direct comparison reported to AlphaEvolve's circle packing solution with sample size notation n=26 provided in the excerpt; implies evaluation over 26 instances or trials.
high positive optimize_anything: A Universal API for Optimizing any Text P... circle packing solution quality (optimization objective)
The system generates CUDA kernels where 87% match or beat PyTorch.
Reported evaluation of generated CUDA kernels against PyTorch implementations; paper states 87% of generated kernels match or outperform PyTorch.
high positive optimize_anything: A Universal API for Optimizing any Text P... proportion of generated CUDA kernels that match or beat PyTorch performance
The system finds scheduling algorithms that cut cloud costs by 40%.
Paper reports that its discovered scheduling algorithms reduce cloud costs by 40%; presumably measured by evaluating cost of scheduled workloads before/after optimization.
high positive optimize_anything: A Universal API for Optimizing any Text P... cloud cost (monetary cost) reduction
The system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%).
Reported comparison to Gemini Flash on the ARC-AGI benchmark with explicit accuracy numbers (32.5% baseline to 89.5% after optimization). Method: discovered agent architectures via LLM-based search; benchmark evaluation on ARC-AGI.
A single AI-based optimization system achieves state-of-the-art results across six diverse tasks.
Paper reports experiments applying a single LLM-based optimization system to six diverse tasks and claims SOTA results across them; no further per-task details provided in the excerpt.
high positive optimize_anything: A Universal API for Optimizing any Text P... task performance / state-of-the-art accuracy across six tasks
The framework extends platform capitalism theory to professional service contexts.
Theoretical contribution claimed in the paper, integrating platform capitalism literature with sociology of professions and critical information science.
high positive Operating the franchise: vendor consolidation, algorithmic m... theoretical extension / conceptual contribution
Resistance requires collective organising, alternative infrastructure development, and recognition that current AI implementations conflict with core professional values.
Normative conclusion drawn from the paper's critical qualitative analysis and theoretical framing; prescriptive recommendations rather than empirical measurement.
high positive Operating the franchise: vendor consolidation, algorithmic m... policy and collective action recommendations for professional resistance and alt...
Vendor monopolies (84% ARL member institutions market share at peak concentration).
Market concentration data synthesized in the paper (reported peak share among ARL member institutions).
high positive Operating the franchise: vendor consolidation, algorithmic m... market share of vendor(s) among ARL member institutions
The intervention significantly improved AI advice by reducing the direct mirroring of incorrect user rankings.
In the same controlled experiment (n=60) with pre/post prompting training, authors report a statistically significant improvement in AI advice after training, characterized by reduced direct mirroring of participants' incorrect rankings.
high positive The Hidden Cost of Contextual Sycophancy: an AI Literacy Int... degree of mirroring in AI advice / AI advice quality
We introduce the concept [of twin agents], distinguish it from digital twins, and outline the research questions this new class of agent demands.
Stated contribution of the paper (conceptual development and research agenda); content claim about what the paper contains rather than an empirical finding.
high positive From Role to Person: Trust Calibration Challenges in Twin Ag... conceptual_contribution_and_research_agenda
Cognitive forcing functions and related frameworks address overreliance effectively in contexts where there is a clear boundary between the AI and the human decision-maker.
Claim based on literature and frameworks cited or discussed by the authors (asserted effectiveness in boundary-defined contexts); the abstract does not provide empirical evaluation details or sample sizes.
The next role on that list is more personal: you — digital twins of each individual (twin agents) representing their knowledge, perspective, and communicative style to colleagues when they are unavailable.
Proposed argument supported by the authors' early design work in an ongoing project; conceptual proposal rather than reported empirical validation in the abstract.
high positive From Role to Person: Trust Calibration Challenges in Twin Ag... representation_of_individual_knowledge_perspective_style
Agentic AI has taken on the role of assistant, collaborator, and decision-support tool.
Asserted in the paper's framing/introduction; based on synthesis of prior work and the authors' characterization of current agentic-AI deployments (no empirical sample or quantitative data reported in the abstract).
high positive From Role to Person: Trust Calibration Challenges in Twin Ag... adoption_of_agentic_roles
The evaluation harness records full trajectories and computes auditable partial-credit rewards.
System description in the paper specifying that the evaluation harness captures full action trajectories and implements an auditable partial-credit reward computation.
high positive OpenComputer: Verifiable Software Worlds for Computer-Use Ag... availability of full trajectories and partial-credit reward computation (qualita...