Evidence (6574 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	761	200	101	904	2020
Governance & Regulation	829	400	191	122	1566
Organizational Efficiency	784	193	125	84	1197
Technology Adoption Rate	637	236	124	97	1103
Research Productivity	431	131	58	340	972
Output Quality	481	183	59	47	770
Decision Quality	332	177	82	49	647
Firm Productivity	439	57	88	20	610
AI Safety & Ethics	218	279	66	33	602
Market Structure	181	170	123	24	503
Task Allocation	214	64	72	33	388
Skill Acquisition	174	62	62	17	315
Innovation Output	204	27	45	18	295
Employment Level	105	54	108	13	282
Fiscal & Macroeconomic	132	69	43	26	277
Consumer Welfare	117	63	42	11	233
Firm Revenue	154	48	26	3	231
Task Completion Time	173	31	8	12	225
Inequality Measures	44	123	50	6	223
Worker Satisfaction	89	65	22	12	188
Error Rate	71	92	10	2	175
Regulatory Compliance	77	69	14	5	165
Automation Exposure	58	56	26	13	156
Training Effectiveness	96	21	14	19	152
Wages & Compensation	77	37	25	6	145
Team Performance	86	17	27	10	141
Developer Productivity	95	17	14	6	133
Job Displacement	12	81	21	1	115
Hiring & Recruitment	52	7	8	3	70
Creative Output	32	20	8	3	64
Skill Obsolescence	5	47	6	1	59
Social Protection	28	16	8	2	54
Labor Share of Income	17	19	17	—	53
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

Human Ai Collab Remove filter

Given the mixed outcomes (some improvements, some new lint/security issues), stronger tool-in-the-loop quality and security gating is motivated for AI-driven development workflows.

Interpretation/recommendation based on observed mix of improvements and introduced issues from the empirical results (PyQu, Pylint, Bandit analyses) and high merge rates.

high positive Quality and Security Signals in AI-Generated Python Refactor... policy/process recommendation (quality/security gating)

73.5% of the analyzed PRs are merged (developer acceptance is high).

Empirical measurement of PR outcomes (merged vs. not merged) in the AIDev dataset of Python refactoring PRs.

high positive Quality and Security Signals in AI-Generated Python Refactor... PR merge rate (acceptance)

Usability is the quality attribute that improves most frequently, improving in 36.5% of the studied changes.

PyQu-based before-and-after analysis of quality attributes on Python refactoring PRs from the AIDev dataset; reported frequency for the 'usability' attribute.

high positive Quality and Security Signals in AI-Generated Python Refactor... usability (one of PyQu's quality attributes)

Agentic commits improve a quality attribute in 22.5% of the studied changes.

Empirical analysis of Python refactoring pull requests from the AIDev dataset using PyQu (an ML-based Python quality assessment tool) to compare quality attributes before and after each change.

high positive Quality and Security Signals in AI-Generated Python Refactor... improvement in any measured code quality attribute (per change)

The proposed taxonomy advances understanding and provides a structured framework for studying emerging human–algorithmic supervisory arrangements in organizations.

Authors' asserted contribution based on literature synthesis and their taxonomy derived from analysis of 14 real-world settings; intended to guide future research.

high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation

We demonstrate the taxonomy’s applicability through three ACoS examples.

Authors state they applied the taxonomy to three examples (case applications) to show applicability; abstract reports N=3 examples.

high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation

We identify two meta-dimensions, control collaboration and control enactment, and six dimensions that enable researchers to categorize and compare ACoS across organizations.

Taxonomy derived from the authors' analysis (14 real-world settings) and literature synthesis; specific dimensions enumerated in paper (as summarized in abstract).

high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation

Building on prior literature and an analysis of 14 real-world ACoS settings, we propose a taxonomy that conceptualizes the phenomenon.

Method stated in abstract: literature review plus qualitative/empirical analysis of 14 real-world ACoS settings; taxonomy presented as an output.

high positive A Taxonomy Of Algorithmic Co-Supervision governance_and_regulation

Organizations increasingly weave algorithmic systems into control processes.

Statement supported by prior literature review and the paper's motivating statements (no specific empirical trend data reported in abstract).

high positive A Taxonomy Of Algorithmic Co-Supervision adoption_rate

Current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution.

Qualitative and/or quantitative evaluation results in paper indicating strengths in spatial grounding, multimodal alignment, and coordinated action execution.

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... spatial grounding, multimodal alignment, coordinated action execution

We develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding.

Methodological contribution described in paper: parser implementation that converts recordings and logs into structured GUI action trajectories.

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... ability to produce structured, grounded GUI action trajectories from recordings/...

The tasks involve dense multimodal interfaces and tightly coupled interaction sequences.

Task descriptions and dataset characteristics in paper stating tasks are complex, long-horizon, multimodal, and tightly coupled.

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... interface complexity and interaction coupling in tasks

We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows.

Dataset construction reported in paper: curated expert demonstrations spanning 7 applications and 186 tasks (numbers provided in text).

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... size and scope of demonstration dataset (number of applications and tasks)

We introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments.

Paper describes the creation of the Cutverse benchmark as a central contribution (design and implementation described in methods).

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... existence and design of a benchmark for GUI agents in media post-production

GUI agents have made significant progress in web navigation and basic operating system tasks.

Background claim stated in paper referencing prior work on GUI agents applied to web navigation and OS tasks (no specific experiments in this paper to support it).

high positive CutVerse: A Compositional GUI Agents Benchmark for Media Pos... capability progress on web navigation and OS tasks

The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), enabling sustained operation beyond full-context limits.

Reported stress test / capability demonstration in paper: profile size stated as 14,000+ facts and 125k tokens stored and managed by the system.

high positive Episodic-Semantic Memory Architecture for Long-Horizon Scien... number of scientific facts and token footprint the system can manage (profile ca...

The Dual Process system maintains 70-85% accuracy with 1-2 second latency while using 62% fewer tokens (45,434 vs 120,000+ limit) compared to full-context approaches.

Reported empirical results from the large-scale evaluation (1,440 queries / 15,000 messages) comparing Dual Process to full-context models; exact accuracy, latency, and token-count figures provided in the paper.

high positive Episodic-Semantic Memory Architecture for Long-Horizon Scien... accuracy; latency (seconds); token usage

The Dual Process Memory Architecture decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message).

System design description and measured consolidation growth rate reported in the paper; empirical observation of growth rate stated.

high positive Episodic-Semantic Memory Architecture for Long-Horizon Scien... episodic window size; long-term memory growth rate (tokens/message)

Experiments on real-world and synthetic tabular datasets show that SPN consistently improves robustness and predictive performance under strategic manipulation compared with both tabular foundation models and classical tabular methods.

Empirical experiments reported in the paper (on unspecified real-world and synthetic tabular datasets) comparing SPN to PFN-style tabular foundation models and classical tabular methods; the abstract claims consistent improvements but does not report sample sizes, dataset names, or quantitative effect sizes.

high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... robustness and predictive performance under strategic manipulation

SPN constructs strategic in-context examples to approximate post-manipulation inputs and aligns PFN predictions with the induced strategic distribution.

Description of SPN's mechanism in the paper (methodological detail). Presented as the approach used to approximate strategic post-manipulation inputs and align predictions; no quantitative details or sample sizes in the abstract.

high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... alignment of PFN predictions with induced strategic distribution

We propose Strategic Prior-data Fitted Network (SPN), an inference-time strategy-aware framework that adapts tabular foundation models to strategic environments without retraining.

Methodological contribution described in the paper: SPN is introduced as an inference-time framework that modifies behavior without retraining. This is a description of the proposed method rather than quantified empirical evidence; no sample sizes reported in the abstract.

high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... ability to adapt PFN-style models to strategic environments at inference time (n...

Tabular foundation models based on pretrained prior-data fitted networks (PFNs) have shown strong generalization on diverse tabular tasks, but they are typically designed for non-strategic settings where data distributions are independent of deployed classifiers.

Statement in the paper situating PFN-style tabular foundation models as having strong generalization in prior work and noting their design assumption of non-strategic, classifier-independent data distributions; no dataset/sample sizes provided in the abstract.

high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... generalization performance of PFN-style tabular foundation models on non-strateg...

Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.

Conclusion drawn from experimental findings that cleanliness materially influenced agent operational metrics (tokens and revisits) even when pass rates were unchanged.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... factors materially affecting agent behaviour (operational footprint/navigation)

Traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents.

Interpretation based on experimental results showing token and navigational efficiency gains on cleaner code (7–8% fewer tokens, 34% fewer revisitations) despite unchanged pass rates.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... relevance of maintainability principles to agent computational cost and navigati...

Agents working on cleaner code reduce file revisitations by 34%.

Empirical measurement across the same experimental trials comparing agent file-revisitation counts between clean and messy repo variants; reported 34% reduction in file revisitations on cleaner code.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... file revisitations (number of times agents revisit files)

Agents working on cleaner code use 7 to 8% fewer tokens.

Empirical measurement across trials (660 trials with Claude Code) comparing token consumption between clean and messy repository variants; reported decrease of 7-8% in tokens when working on cleaner code.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... token usage (number of tokens consumed by agent pipelines)

We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface.

Reported experimental design: 33 authored tasks spanning six repository pairs; evaluation used hidden tests executed at the application's public surface.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... number of tasks and pairs used in evaluation

The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one.

Method description: authors constructed pairs bidirectionally using agent pipelines that modify repositories to create matched clean/messy variants.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... directional construction of repository pairs (degrade or clean)

We introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity.

Methodological description in paper: construction of paired repositories controlling for architecture, dependencies, and external behaviour while varying static-analysis violations and cognitive complexity.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... evaluation protocol (minimal-pair control of repository cleanliness)

A simple prompt checklist can improve LLM responses while reducing unnecessary interaction.

Authors' interpretation/conclusion drawn from the experimental comparisons and rubric scores reported in the paper's results.

high positive Less Back-and-Forth: A Comparative Study of Structured Promp... output_quality and user_interaction

Checklist prompts produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts.

Reported comparative statement in the results that checklist prompts used fewer average tokens and produced a better quality-effort tradeoff (no token counts, sample size, or statistical tests reported in the abstract).

high positive Less Back-and-Forth: A Comparative Study of Structured Promp... average_tokens_used (user effort) and output_quality

Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts.

Reported mean rubric scores for each prompt condition in the paper's results (no sample sizes or significance tests provided in the abstract).

high positive Less Back-and-Forth: A Comparative Study of Structured Promp... rubric_score (task completion / correctness / compliance / clarity)

The authors open-source optimize_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa.

Explicit statement and provided GitHub URL in the paper excerpt.

high positive optimize_anything: A Universal API for Optimizing any Text P... availability of open-source code / tooling

Multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks.

Reported experiments comparing multi-task search versus independent per-problem optimization under equal per-problem budget; observed cross-task transfer benefits and that benefits increase with more related tasks.

high positive optimize_anything: A Universal API for Optimizing any Text P... optimization performance (e.g., score) under multi-task vs independent optimizat...

Ablations across three domains reveal that actionable side information yields substantially higher final scores than score-only feedback.

Same ablation studies across three domains as above; reported higher final optimization scores when using actionable side information compared to only score feedback.

high positive optimize_anything: A Universal API for Optimizing any Text P... final optimization score

Ablations across three domains reveal that actionable side information yields faster convergence than score-only feedback.

Paper reports ablation studies in three domains comparing optimization with actionable side information versus score-only feedback and finds faster convergence with side information.

high positive optimize_anything: A Universal API for Optimizing any Text P... convergence speed (time or iterations to converge)

The system outperforms AlphaEvolve's reported circle packing solution (n=26).

Direct comparison reported to AlphaEvolve's circle packing solution with sample size notation n=26 provided in the excerpt; implies evaluation over 26 instances or trials.

high positive optimize_anything: A Universal API for Optimizing any Text P... circle packing solution quality (optimization objective)

The system generates CUDA kernels where 87% match or beat PyTorch.

Reported evaluation of generated CUDA kernels against PyTorch implementations; paper states 87% of generated kernels match or outperform PyTorch.

high positive optimize_anything: A Universal API for Optimizing any Text P... proportion of generated CUDA kernels that match or beat PyTorch performance

The system finds scheduling algorithms that cut cloud costs by 40%.

Paper reports that its discovered scheduling algorithms reduce cloud costs by 40%; presumably measured by evaluating cost of scheduled workloads before/after optimization.

high positive optimize_anything: A Universal API for Optimizing any Text P... cloud cost (monetary cost) reduction

The system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%).

Reported comparison to Gemini Flash on the ARC-AGI benchmark with explicit accuracy numbers (32.5% baseline to 89.5% after optimization). Method: discovered agent architectures via LLM-based search; benchmark evaluation on ARC-AGI.

high positive optimize_anything: A Universal API for Optimizing any Text P... ARC-AGI accuracy

A single AI-based optimization system achieves state-of-the-art results across six diverse tasks.

Paper reports experiments applying a single LLM-based optimization system to six diverse tasks and claims SOTA results across them; no further per-task details provided in the excerpt.

high positive optimize_anything: A Universal API for Optimizing any Text P... task performance / state-of-the-art accuracy across six tasks

The framework extends platform capitalism theory to professional service contexts.

Theoretical contribution claimed in the paper, integrating platform capitalism literature with sociology of professions and critical information science.

high positive Operating the franchise: vendor consolidation, algorithmic m... theoretical extension / conceptual contribution

Resistance requires collective organising, alternative infrastructure development, and recognition that current AI implementations conflict with core professional values.

Normative conclusion drawn from the paper's critical qualitative analysis and theoretical framing; prescriptive recommendations rather than empirical measurement.

high positive Operating the franchise: vendor consolidation, algorithmic m... policy and collective action recommendations for professional resistance and alt...

Vendor monopolies (84% ARL member institutions market share at peak concentration).

Market concentration data synthesized in the paper (reported peak share among ARL member institutions).

high positive Operating the franchise: vendor consolidation, algorithmic m... market share of vendor(s) among ARL member institutions

The intervention significantly improved AI advice by reducing the direct mirroring of incorrect user rankings.

In the same controlled experiment (n=60) with pre/post prompting training, authors report a statistically significant improvement in AI advice after training, characterized by reduced direct mirroring of participants' incorrect rankings.

high positive The Hidden Cost of Contextual Sycophancy: an AI Literacy Int... degree of mirroring in AI advice / AI advice quality

We introduce the concept [of twin agents], distinguish it from digital twins, and outline the research questions this new class of agent demands.

Stated contribution of the paper (conceptual development and research agenda); content claim about what the paper contains rather than an empirical finding.

high positive From Role to Person: Trust Calibration Challenges in Twin Ag... conceptual_contribution_and_research_agenda

Cognitive forcing functions and related frameworks address overreliance effectively in contexts where there is a clear boundary between the AI and the human decision-maker.

Claim based on literature and frameworks cited or discussed by the authors (asserted effectiveness in boundary-defined contexts); the abstract does not provide empirical evaluation details or sample sizes.

high positive From Role to Person: Trust Calibration Challenges in Twin Ag... overreliance_reduction

The next role on that list is more personal: you — digital twins of each individual (twin agents) representing their knowledge, perspective, and communicative style to colleagues when they are unavailable.

Proposed argument supported by the authors' early design work in an ongoing project; conceptual proposal rather than reported empirical validation in the abstract.

high positive From Role to Person: Trust Calibration Challenges in Twin Ag... representation_of_individual_knowledge_perspective_style

Agentic AI has taken on the role of assistant, collaborator, and decision-support tool.

Asserted in the paper's framing/introduction; based on synthesis of prior work and the authors' characterization of current agentic-AI deployments (no empirical sample or quantitative data reported in the abstract).

high positive From Role to Person: Trust Calibration Challenges in Twin Ag... adoption_of_agentic_roles

The evaluation harness records full trajectories and computes auditable partial-credit rewards.

System description in the paper specifying that the evaluation harness captures full action trajectories and implements an auditable partial-credit reward computation.

high positive OpenComputer: Verifiable Software Worlds for Computer-Use Ag... availability of full trajectories and partial-credit reward computation (qualita...

« Prev 1 2 3 … 58 59 60 … 131 132 Next »