Evidence (13827 claims)

Evidence Matrix

Claim counts by outcome category and direction of finding.

Outcome	Positive	Negative	Mixed	Null	Total
Other	749	195	97	889	1979
Governance & Regulation	815	391	188	121	1539
Organizational Efficiency	771	189	124	83	1177
Technology Adoption Rate	624	233	123	96	1084
Research Productivity	410	121	56	331	929
Output Quality	466	177	59	47	749
Decision Quality	320	174	75	42	618
Firm Productivity	435	55	88	20	604
AI Safety & Ethics	214	276	65	33	593
Market Structure	178	166	122	24	495
Task Allocation	206	64	70	31	376
Skill Acquisition	165	57	60	17	299
Innovation Output	201	27	41	18	288
Employment Level	105	51	107	13	278
Fiscal & Macroeconomic	131	69	43	26	276
Consumer Welfare	116	63	42	11	232
Firm Revenue	149	46	26	3	224
Inequality Measures	44	122	49	6	221
Task Completion Time	169	29	8	12	219
Worker Satisfaction	89	61	20	12	182
Error Rate	69	91	10	2	172
Regulatory Compliance	76	68	14	5	163
Training Effectiveness	92	19	13	19	145
Wages & Compensation	77	36	25	6	144
Automation Exposure	51	54	22	12	142
Team Performance	86	17	27	9	140
Developer Productivity	94	17	14	6	132
Job Displacement	12	80	20	1	113
Hiring & Recruitment	51	7	8	3	69
Skill Obsolescence	5	45	6	1	57
Creative Output	31	16	7	2	57
Social Protection	27	16	8	2	53
Labor Share of Income	17	17	17	—	51
Worker Turnover	11	12	—	3	26
Industry	—	—	—	1	1

We propose Strategic Prior-data Fitted Network (SPN), an inference-time strategy-aware framework that adapts tabular foundation models to strategic environments without retraining.

Methodological contribution described in the paper: SPN is introduced as an inference-time framework that modifies behavior without retraining. This is a description of the proposed method rather than quantified empirical evidence; no sample sizes reported in the abstract.

high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... ability to adapt PFN-style models to strategic environments at inference time (n...

Tabular foundation models based on pretrained prior-data fitted networks (PFNs) have shown strong generalization on diverse tabular tasks, but they are typically designed for non-strategic settings where data distributions are independent of deployed classifiers.

Statement in the paper situating PFN-style tabular foundation models as having strong generalization in prior work and noting their design assumption of non-strategic, classifier-independent data distributions; no dataset/sample sizes provided in the abstract.

high positive When Tabular Foundation Models Meet Strategic Tabular Data: ... generalization performance of PFN-style tabular foundation models on non-strateg...

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task pass rates by an average of 16.2 percentage points across diverse domains.

Authors cite prior Skills benchmarks / aggregated reports (benchmark summary referenced in paper); average improvement reported as 16.2 percentage points across tasks in those benchmarks (implied sample of tasks from the referenced benchmark).

high positive When Skills Don't Help: A Negative Result on Procedural Know... task pass rate (task success rate)

Software products and software R&D contributed 50 percent of the 1.2 percentage point acceleration in nonfarm business labor productivity (2017–2024 relative to 2012–2017).

Empirical decomposition comparing productivity growth rates across periods (2017–2024 vs 2012–2017) in the paper; the authors attribute half of the observed 1.2 percentage point acceleration to software products and software R&D.

high positive AI as an Innovation in the Method of Innovation: Implication... acceleration (difference) in nonfarm business labor productivity growth between ...

Software products and software R&D contributed 50 percent of the 2 percent average growth rate in nonfarm business labor productivity from 2017 to 2024.

Empirical decomposition of nonfarm business labor productivity growth in the United States for the period 2017–2024 reported in the paper (the authors attribute shares of the observed 2% average growth to components including software products and software R&D).

high positive AI as an Innovation in the Method of Innovation: Implication... average growth rate in nonfarm business labor productivity (2017–2024)

AI is already materially affecting official productivity measures in the United States.

Empirical decomposition of U.S. productivity data reported in the paper that attributes portions of measured productivity growth to software-related channels linked to AI.

high positive AI as an Innovation in the Method of Innovation: Implication... official productivity measures (U.S. nonfarm business labor productivity)

Using a framework that separates upstream innovation from downstream production suggests that AI boosts both upstream total factor productivity and intangible capital use downstream.

Model/framework decomposition in the paper (theoretical separation of upstream vs downstream, combined with empirical application to productivity data); the paper reports results consistent with increases in upstream TFP and downstream intangible capital use.

high positive AI as an Innovation in the Method of Innovation: Implication... upstream total factor productivity and downstream intangible capital use

Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.

Conclusion drawn from experimental findings that cleanliness materially influenced agent operational metrics (tokens and revisits) even when pass rates were unchanged.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... factors materially affecting agent behaviour (operational footprint/navigation)

Traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents.

Interpretation based on experimental results showing token and navigational efficiency gains on cleaner code (7–8% fewer tokens, 34% fewer revisitations) despite unchanged pass rates.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... relevance of maintainability principles to agent computational cost and navigati...

Agents working on cleaner code reduce file revisitations by 34%.

Empirical measurement across the same experimental trials comparing agent file-revisitation counts between clean and messy repo variants; reported 34% reduction in file revisitations on cleaner code.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... file revisitations (number of times agents revisit files)

Agents working on cleaner code use 7 to 8% fewer tokens.

Empirical measurement across trials (660 trials with Claude Code) comparing token consumption between clean and messy repository variants; reported decrease of 7-8% in tokens when working on cleaner code.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... token usage (number of tokens consumed by agent pipelines)

We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface.

Reported experimental design: 33 authored tasks spanning six repository pairs; evaluation used hidden tests executed at the application's public surface.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... number of tasks and pairs used in evaluation

The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one.

Method description: authors constructed pairs bidirectionally using agent pipelines that modify repositories to create matched clean/messy variants.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... directional construction of repository pairs (degrade or clean)

We introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity.

Methodological description in paper: construction of paired repositories controlling for architecture, dependencies, and external behaviour while varying static-analysis violations and cognitive complexity.

high positive Does Code Cleanliness Affect Coding Agents? A Controlled Min... evaluation protocol (minimal-pair control of repository cleanliness)

A simple prompt checklist can improve LLM responses while reducing unnecessary interaction.

Authors' interpretation/conclusion drawn from the experimental comparisons and rubric scores reported in the paper's results.

high positive Less Back-and-Forth: A Comparative Study of Structured Promp... output_quality and user_interaction

Checklist prompts produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts.

Reported comparative statement in the results that checklist prompts used fewer average tokens and produced a better quality-effort tradeoff (no token counts, sample size, or statistical tests reported in the abstract).

high positive Less Back-and-Forth: A Comparative Study of Structured Promp... average_tokens_used (user effort) and output_quality

Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts.

Reported mean rubric scores for each prompt condition in the paper's results (no sample sizes or significance tests provided in the abstract).

high positive Less Back-and-Forth: A Comparative Study of Structured Promp... rubric_score (task completion / correctness / compliance / clarity)

The authors open-source optimize_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa.

Explicit statement and provided GitHub URL in the paper excerpt.

high positive optimize_anything: A Universal API for Optimizing any Text P... availability of open-source code / tooling

Multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks.

Reported experiments comparing multi-task search versus independent per-problem optimization under equal per-problem budget; observed cross-task transfer benefits and that benefits increase with more related tasks.

high positive optimize_anything: A Universal API for Optimizing any Text P... optimization performance (e.g., score) under multi-task vs independent optimizat...

Ablations across three domains reveal that actionable side information yields substantially higher final scores than score-only feedback.

Same ablation studies across three domains as above; reported higher final optimization scores when using actionable side information compared to only score feedback.

high positive optimize_anything: A Universal API for Optimizing any Text P... final optimization score

Ablations across three domains reveal that actionable side information yields faster convergence than score-only feedback.

Paper reports ablation studies in three domains comparing optimization with actionable side information versus score-only feedback and finds faster convergence with side information.

high positive optimize_anything: A Universal API for Optimizing any Text P... convergence speed (time or iterations to converge)

The system outperforms AlphaEvolve's reported circle packing solution (n=26).

Direct comparison reported to AlphaEvolve's circle packing solution with sample size notation n=26 provided in the excerpt; implies evaluation over 26 instances or trials.

high positive optimize_anything: A Universal API for Optimizing any Text P... circle packing solution quality (optimization objective)

The system generates CUDA kernels where 87% match or beat PyTorch.

Reported evaluation of generated CUDA kernels against PyTorch implementations; paper states 87% of generated kernels match or outperform PyTorch.

high positive optimize_anything: A Universal API for Optimizing any Text P... proportion of generated CUDA kernels that match or beat PyTorch performance

The system finds scheduling algorithms that cut cloud costs by 40%.

Paper reports that its discovered scheduling algorithms reduce cloud costs by 40%; presumably measured by evaluating cost of scheduled workloads before/after optimization.

high positive optimize_anything: A Universal API for Optimizing any Text P... cloud cost (monetary cost) reduction

The system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%).

Reported comparison to Gemini Flash on the ARC-AGI benchmark with explicit accuracy numbers (32.5% baseline to 89.5% after optimization). Method: discovered agent architectures via LLM-based search; benchmark evaluation on ARC-AGI.

high positive optimize_anything: A Universal API for Optimizing any Text P... ARC-AGI accuracy

A single AI-based optimization system achieves state-of-the-art results across six diverse tasks.

Paper reports experiments applying a single LLM-based optimization system to six diverse tasks and claims SOTA results across them; no further per-task details provided in the excerpt.

high positive optimize_anything: A Universal API for Optimizing any Text P... task performance / state-of-the-art accuracy across six tasks

The framework extends platform capitalism theory to professional service contexts.

Theoretical contribution claimed in the paper, integrating platform capitalism literature with sociology of professions and critical information science.

high positive Operating the franchise: vendor consolidation, algorithmic m... theoretical extension / conceptual contribution

Resistance requires collective organising, alternative infrastructure development, and recognition that current AI implementations conflict with core professional values.

Normative conclusion drawn from the paper's critical qualitative analysis and theoretical framing; prescriptive recommendations rather than empirical measurement.

high positive Operating the franchise: vendor consolidation, algorithmic m... policy and collective action recommendations for professional resistance and alt...

Vendor monopolies (84% ARL member institutions market share at peak concentration).

Market concentration data synthesized in the paper (reported peak share among ARL member institutions).

high positive Operating the franchise: vendor consolidation, algorithmic m... market share of vendor(s) among ARL member institutions

In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI.

Reported quantitative results from large-scale online deployment on Taobao (real-world A/B/test deployment; exact sample size, duration, and statistical significance are not stated in the excerpt).

high positive Generative Auto-Bidding with Unified Modeling and Exploratio... ad GMV; ad clicks; ad cost; ad ROI

Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios.

Aggregate claim summarizing experimental comparisons versus state-of-the-art baselines across the reported evaluations (public datasets, simulations, and live deployment). Specific baselines, metrics, and statistical details are not provided in the excerpt.

high positive Generative Auto-Bidding with Unified Modeling and Exploratio... performance relative to state-of-the-art baselines

GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions, a Q-value module to guide DT exploration via regularization constraints, and an Inverse Dynamics Module (IDM) that leverages DT-predicted future states to infer robust behaviorally consistent actions as a safe policy fallback.

Detailed methodological description of GUIDE components and their intended roles in the paper (architectural claim).

high positive Generative Auto-Bidding with Unified Modeling and Exploratio... architectural design: DT + Q-value regularization + IDM fallback

We propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism.

Methodological contribution described in the paper (design/proposal of the GUIDE framework).

high positive Generative Auto-Bidding with Unified Modeling and Exploratio... integration of directed exploration and safe fallback in an auto-bidding framewo...

The intervention significantly improved AI advice by reducing the direct mirroring of incorrect user rankings.

In the same controlled experiment (n=60) with pre/post prompting training, authors report a statistically significant improvement in AI advice after training, characterized by reduced direct mirroring of participants' incorrect rankings.

high positive The Hidden Cost of Contextual Sycophancy: an AI Literacy Int... degree of mirroring in AI advice / AI advice quality

We introduce the concept [of twin agents], distinguish it from digital twins, and outline the research questions this new class of agent demands.

Stated contribution of the paper (conceptual development and research agenda); content claim about what the paper contains rather than an empirical finding.

high positive From Role to Person: Trust Calibration Challenges in Twin Ag... conceptual_contribution_and_research_agenda

Cognitive forcing functions and related frameworks address overreliance effectively in contexts where there is a clear boundary between the AI and the human decision-maker.

Claim based on literature and frameworks cited or discussed by the authors (asserted effectiveness in boundary-defined contexts); the abstract does not provide empirical evaluation details or sample sizes.

high positive From Role to Person: Trust Calibration Challenges in Twin Ag... overreliance_reduction

The next role on that list is more personal: you — digital twins of each individual (twin agents) representing their knowledge, perspective, and communicative style to colleagues when they are unavailable.

Proposed argument supported by the authors' early design work in an ongoing project; conceptual proposal rather than reported empirical validation in the abstract.

high positive From Role to Person: Trust Calibration Challenges in Twin Ag... representation_of_individual_knowledge_perspective_style

Agentic AI has taken on the role of assistant, collaborator, and decision-support tool.

Asserted in the paper's framing/introduction; based on synthesis of prior work and the authors' characterization of current agentic-AI deployments (no empirical sample or quantitative data reported in the abstract).

high positive From Role to Person: Trust Calibration Challenges in Twin Ag... adoption_of_agentic_roles

The evaluation harness records full trajectories and computes auditable partial-credit rewards.

System description in the paper specifying that the evaluation harness captures full action trajectories and implements an auditable partial-credit reward computation.

high positive OpenComputer: Verifiable Software Worlds for Computer-Use Ag... availability of full trajectories and partial-credit reward computation (qualita...

OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state.

Experimental comparison reported in the paper between hard-coded verifiers and LLM-as-judge evaluations, measured against human adjudication (presumably over the benchmark tasks).

high positive OpenComputer: Verifiable Software Worlds for Computer-Use Ag... alignment with human adjudication / evaluation accuracy

OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards.

System design description presented in the paper (architectural claim listing four components and their intended functions).

high positive OpenComputer: Verifiable Software Worlds for Computer-Use Ag... system architecture/components (qualitative)

In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications.

System description / reported inventory in the paper (explicit counts of covered applications and finalized tasks).

high positive OpenComputer: Verifiable Software Worlds for Computer-Use Ag... coverage of applications and tasks (count)

The paper offers a research agenda for more effective human-AI collaboration in software engineering.

Authors' concluding recommendations and agenda presented in the paper (conceptual / prescriptive contribution).

high positive Rethinking Code Review in the Age of AI: A Vision for Agenti... research directions proposed for human-AI collaboration effectiveness

Humans are retained at key decision points in the workflow to preserve judgment, accountability, and team-level understanding.

Authors' design rationale / argument for human-in-the-loop controls within their proposed workflow (conceptual justification).

high positive Rethinking Code Review in the Age of AI: A Vision for Agenti... degree of human involvement / accountability in the workflow

The proposed framework spans five stages: PR Creation, PR Augmentation, Reviewer Selection, AI-Assisted Code Review, and PR Retrospective.

Authors' explicit description of their framework stages in the paper (conceptual/design content).

high positive Rethinking Code Review in the Age of AI: A Vision for Agenti... stages of proposed PR review workflow

We present a vision for an AI-powered code review workflow combining specialized agents with human-controlled quality gates.

Paper authors' proposed conceptual framework / design contribution (framework description rather than empirical validation).

high positive Rethinking Code Review in the Age of AI: A Vision for Agenti... design of AI-powered code review workflow (presence of agents + human quality ga...

The rise of Artificial Intelligence (AI) coding assistants has increased code production velocity.

Authors' summary statement about observed effects of AI coding assistants; based on prior literature/observations rather than a reported experiment in this paper's abstract.

high positive Rethinking Code Review in the Age of AI: A Vision for Agenti... code production velocity

Compute expansion increases data-centre electricity pressure.

Public institutional data on compute expansion and data-centre electricity demand analyzed with growth indicators (CAGR, relative growth) showing rising electricity demand associated with compute capacity expansion.

high positive The Agentic Economy: Humans, AI Agents, Robots, and the Meas... data-centre electricity demand

Industrial robots represent persistent cyber-physical action capacity (as evidenced by installations and operational stock).

Use of public data on robot installations and operational stock, summarized via stock-flow ratios and related indicators to characterize persistent robotic action capacity.

high positive The Agentic Economy: Humans, AI Agents, Robots, and the Meas... robot installations / operational robot stock

AI investment signals broad capital allocation.

Public institutional data on AI investment examined with indicators such as growth multipliers, CAGR and concentration ratios to infer capital allocation patterns.

high positive The Agentic Economy: Humans, AI Agents, Robots, and the Meas... AI investment levels / capital allocation to AI

« Prev 1 2 3 … 112 113 114 … 276 277 Next »