Evidence (6574 claims)
Adoption
8625 claims
Productivity
7686 claims
Governance
6917 claims
Human-AI Collaboration
6574 claims
Org Design
4189 claims
Innovation
4131 claims
Labor Markets
3588 claims
Skills & Training
2985 claims
Inequality
2066 claims
Evidence Matrix
Claim counts by outcome category and direction of finding.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 761 | 200 | 101 | 904 | 2020 |
| Governance & Regulation | 829 | 400 | 191 | 122 | 1566 |
| Organizational Efficiency | 784 | 193 | 125 | 84 | 1197 |
| Technology Adoption Rate | 637 | 236 | 124 | 97 | 1103 |
| Research Productivity | 431 | 131 | 58 | 340 | 972 |
| Output Quality | 481 | 183 | 59 | 47 | 770 |
| Decision Quality | 332 | 177 | 82 | 49 | 647 |
| Firm Productivity | 439 | 57 | 88 | 20 | 610 |
| AI Safety & Ethics | 218 | 279 | 66 | 33 | 602 |
| Market Structure | 181 | 170 | 123 | 24 | 503 |
| Task Allocation | 214 | 64 | 72 | 33 | 388 |
| Skill Acquisition | 174 | 62 | 62 | 17 | 315 |
| Innovation Output | 204 | 27 | 45 | 18 | 295 |
| Employment Level | 105 | 54 | 108 | 13 | 282 |
| Fiscal & Macroeconomic | 132 | 69 | 43 | 26 | 277 |
| Consumer Welfare | 117 | 63 | 42 | 11 | 233 |
| Firm Revenue | 154 | 48 | 26 | 3 | 231 |
| Task Completion Time | 173 | 31 | 8 | 12 | 225 |
| Inequality Measures | 44 | 123 | 50 | 6 | 223 |
| Worker Satisfaction | 89 | 65 | 22 | 12 | 188 |
| Error Rate | 71 | 92 | 10 | 2 | 175 |
| Regulatory Compliance | 77 | 69 | 14 | 5 | 165 |
| Automation Exposure | 58 | 56 | 26 | 13 | 156 |
| Training Effectiveness | 96 | 21 | 14 | 19 | 152 |
| Wages & Compensation | 77 | 37 | 25 | 6 | 145 |
| Team Performance | 86 | 17 | 27 | 10 | 141 |
| Developer Productivity | 95 | 17 | 14 | 6 | 133 |
| Job Displacement | 12 | 81 | 21 | 1 | 115 |
| Hiring & Recruitment | 52 | 7 | 8 | 3 | 70 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 47 | 6 | 1 | 59 |
| Social Protection | 28 | 16 | 8 | 2 | 54 |
| Labor Share of Income | 17 | 19 | 17 | — | 53 |
| Worker Turnover | 11 | 12 | — | 3 | 26 |
| Industry | — | — | — | 1 | 1 |
Human Ai Collab
Remove filter
Given the mixed outcomes (some improvements, some new lint/security issues), stronger tool-in-the-loop quality and security gating is motivated for AI-driven development workflows.
Interpretation/recommendation based on observed mix of improvements and introduced issues from the empirical results (PyQu, Pylint, Bandit analyses) and high merge rates.
73.5% of the analyzed PRs are merged (developer acceptance is high).
Empirical measurement of PR outcomes (merged vs. not merged) in the AIDev dataset of Python refactoring PRs.
Usability is the quality attribute that improves most frequently, improving in 36.5% of the studied changes.
PyQu-based before-and-after analysis of quality attributes on Python refactoring PRs from the AIDev dataset; reported frequency for the 'usability' attribute.
Agentic commits improve a quality attribute in 22.5% of the studied changes.
Empirical analysis of Python refactoring pull requests from the AIDev dataset using PyQu (an ML-based Python quality assessment tool) to compare quality attributes before and after each change.
The proposed taxonomy advances understanding and provides a structured framework for studying emerging human–algorithmic supervisory arrangements in organizations.
Authors' asserted contribution based on literature synthesis and their taxonomy derived from analysis of 14 real-world settings; intended to guide future research.
We demonstrate the taxonomy’s applicability through three ACoS examples.
Authors state they applied the taxonomy to three examples (case applications) to show applicability; abstract reports N=3 examples.
We identify two meta-dimensions, control collaboration and control enactment, and six dimensions that enable researchers to categorize and compare ACoS across organizations.
Taxonomy derived from the authors' analysis (14 real-world settings) and literature synthesis; specific dimensions enumerated in paper (as summarized in abstract).
Building on prior literature and an analysis of 14 real-world ACoS settings, we propose a taxonomy that conceptualizes the phenomenon.
Method stated in abstract: literature review plus qualitative/empirical analysis of 14 real-world ACoS settings; taxonomy presented as an output.
Organizations increasingly weave algorithmic systems into control processes.
Statement supported by prior literature review and the paper's motivating statements (no specific empirical trend data reported in abstract).
Current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution.
Qualitative and/or quantitative evaluation results in paper indicating strengths in spatial grounding, multimodal alignment, and coordinated action execution.
We develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding.
Methodological contribution described in paper: parser implementation that converts recordings and logs into structured GUI action trajectories.
The tasks involve dense multimodal interfaces and tightly coupled interaction sequences.
Task descriptions and dataset characteristics in paper stating tasks are complex, long-horizon, multimodal, and tightly coupled.
We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows.
Dataset construction reported in paper: curated expert demonstrations spanning 7 applications and 186 tasks (numbers provided in text).
We introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments.
Paper describes the creation of the Cutverse benchmark as a central contribution (design and implementation described in methods).
GUI agents have made significant progress in web navigation and basic operating system tasks.
Background claim stated in paper referencing prior work on GUI agents applied to web navigation and OS tasks (no specific experiments in this paper to support it).
The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), enabling sustained operation beyond full-context limits.
Reported stress test / capability demonstration in paper: profile size stated as 14,000+ facts and 125k tokens stored and managed by the system.
The Dual Process system maintains 70-85% accuracy with 1-2 second latency while using 62% fewer tokens (45,434 vs 120,000+ limit) compared to full-context approaches.
Reported empirical results from the large-scale evaluation (1,440 queries / 15,000 messages) comparing Dual Process to full-context models; exact accuracy, latency, and token-count figures provided in the paper.
The Dual Process Memory Architecture decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message).
System design description and measured consolidation growth rate reported in the paper; empirical observation of growth rate stated.
Experiments on real-world and synthetic tabular datasets show that SPN consistently improves robustness and predictive performance under strategic manipulation compared with both tabular foundation models and classical tabular methods.
Empirical experiments reported in the paper (on unspecified real-world and synthetic tabular datasets) comparing SPN to PFN-style tabular foundation models and classical tabular methods; the abstract claims consistent improvements but does not report sample sizes, dataset names, or quantitative effect sizes.
SPN constructs strategic in-context examples to approximate post-manipulation inputs and aligns PFN predictions with the induced strategic distribution.
Description of SPN's mechanism in the paper (methodological detail). Presented as the approach used to approximate strategic post-manipulation inputs and align predictions; no quantitative details or sample sizes in the abstract.
We propose Strategic Prior-data Fitted Network (SPN), an inference-time strategy-aware framework that adapts tabular foundation models to strategic environments without retraining.
Methodological contribution described in the paper: SPN is introduced as an inference-time framework that modifies behavior without retraining. This is a description of the proposed method rather than quantified empirical evidence; no sample sizes reported in the abstract.
Tabular foundation models based on pretrained prior-data fitted networks (PFNs) have shown strong generalization on diverse tabular tasks, but they are typically designed for non-strategic settings where data distributions are independent of deployed classifiers.
Statement in the paper situating PFN-style tabular foundation models as having strong generalization in prior work and noting their design assumption of non-strategic, classifier-independent data distributions; no dataset/sample sizes provided in the abstract.
Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.
Conclusion drawn from experimental findings that cleanliness materially influenced agent operational metrics (tokens and revisits) even when pass rates were unchanged.
Traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents.
Interpretation based on experimental results showing token and navigational efficiency gains on cleaner code (7–8% fewer tokens, 34% fewer revisitations) despite unchanged pass rates.
Agents working on cleaner code reduce file revisitations by 34%.
Empirical measurement across the same experimental trials comparing agent file-revisitation counts between clean and messy repo variants; reported 34% reduction in file revisitations on cleaner code.
Agents working on cleaner code use 7 to 8% fewer tokens.
Empirical measurement across trials (660 trials with Claude Code) comparing token consumption between clean and messy repository variants; reported decrease of 7-8% in tokens when working on cleaner code.
We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface.
Reported experimental design: 33 authored tasks spanning six repository pairs; evaluation used hidden tests executed at the application's public surface.
The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one.
Method description: authors constructed pairs bidirectionally using agent pipelines that modify repositories to create matched clean/messy variants.
We introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity.
Methodological description in paper: construction of paired repositories controlling for architecture, dependencies, and external behaviour while varying static-analysis violations and cognitive complexity.
A simple prompt checklist can improve LLM responses while reducing unnecessary interaction.
Authors' interpretation/conclusion drawn from the experimental comparisons and rubric scores reported in the paper's results.
Checklist prompts produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts.
Reported comparative statement in the results that checklist prompts used fewer average tokens and produced a better quality-effort tradeoff (no token counts, sample size, or statistical tests reported in the abstract).
Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts.
Reported mean rubric scores for each prompt condition in the paper's results (no sample sizes or significance tests provided in the abstract).
The authors open-source optimize_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa.
Explicit statement and provided GitHub URL in the paper excerpt.
Multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks.
Reported experiments comparing multi-task search versus independent per-problem optimization under equal per-problem budget; observed cross-task transfer benefits and that benefits increase with more related tasks.
Ablations across three domains reveal that actionable side information yields substantially higher final scores than score-only feedback.
Same ablation studies across three domains as above; reported higher final optimization scores when using actionable side information compared to only score feedback.
Ablations across three domains reveal that actionable side information yields faster convergence than score-only feedback.
Paper reports ablation studies in three domains comparing optimization with actionable side information versus score-only feedback and finds faster convergence with side information.
The system outperforms AlphaEvolve's reported circle packing solution (n=26).
Direct comparison reported to AlphaEvolve's circle packing solution with sample size notation n=26 provided in the excerpt; implies evaluation over 26 instances or trials.
The system generates CUDA kernels where 87% match or beat PyTorch.
Reported evaluation of generated CUDA kernels against PyTorch implementations; paper states 87% of generated kernels match or outperform PyTorch.
The system finds scheduling algorithms that cut cloud costs by 40%.
Paper reports that its discovered scheduling algorithms reduce cloud costs by 40%; presumably measured by evaluating cost of scheduled workloads before/after optimization.
The system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%).
Reported comparison to Gemini Flash on the ARC-AGI benchmark with explicit accuracy numbers (32.5% baseline to 89.5% after optimization). Method: discovered agent architectures via LLM-based search; benchmark evaluation on ARC-AGI.
A single AI-based optimization system achieves state-of-the-art results across six diverse tasks.
Paper reports experiments applying a single LLM-based optimization system to six diverse tasks and claims SOTA results across them; no further per-task details provided in the excerpt.
The framework extends platform capitalism theory to professional service contexts.
Theoretical contribution claimed in the paper, integrating platform capitalism literature with sociology of professions and critical information science.
Resistance requires collective organising, alternative infrastructure development, and recognition that current AI implementations conflict with core professional values.
Normative conclusion drawn from the paper's critical qualitative analysis and theoretical framing; prescriptive recommendations rather than empirical measurement.
Vendor monopolies (84% ARL member institutions market share at peak concentration).
Market concentration data synthesized in the paper (reported peak share among ARL member institutions).
The intervention significantly improved AI advice by reducing the direct mirroring of incorrect user rankings.
In the same controlled experiment (n=60) with pre/post prompting training, authors report a statistically significant improvement in AI advice after training, characterized by reduced direct mirroring of participants' incorrect rankings.
We introduce the concept [of twin agents], distinguish it from digital twins, and outline the research questions this new class of agent demands.
Stated contribution of the paper (conceptual development and research agenda); content claim about what the paper contains rather than an empirical finding.
Cognitive forcing functions and related frameworks address overreliance effectively in contexts where there is a clear boundary between the AI and the human decision-maker.
Claim based on literature and frameworks cited or discussed by the authors (asserted effectiveness in boundary-defined contexts); the abstract does not provide empirical evaluation details or sample sizes.
The next role on that list is more personal: you — digital twins of each individual (twin agents) representing their knowledge, perspective, and communicative style to colleagues when they are unavailable.
Proposed argument supported by the authors' early design work in an ongoing project; conceptual proposal rather than reported empirical validation in the abstract.
Agentic AI has taken on the role of assistant, collaborator, and decision-support tool.
Asserted in the paper's framing/introduction; based on synthesis of prior work and the authors' characterization of current agentic-AI deployments (no empirical sample or quantitative data reported in the abstract).
The evaluation harness records full trajectories and computes auditable partial-credit rewards.
System description in the paper specifying that the evaluation harness captures full action trajectories and implements an auditable partial-credit reward computation.