Evidence (8974 claims)

Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.

The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).

Browse by theme

Nine broad, paper-level topics. Click one to filter the claims below.

Human-AI Collaboration

Claims by outcome category

Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.

Outcome	Positive	Negative	Mixed	Null	Total
Other	882	244	117	1097	2424
Governance & Regulation	1010	469	229	135	1875
Organizational Efficiency	977	235	149	90	1462
Technology Adoption Rate	781	299	143	128	1362
Research Productivity	506	155	74	363	1110
Output Quality	555	219	71	70	915
Decision Quality	395	200	95	54	751
Firm Productivity	523	67	101	27	724
AI Safety & Ethics	262	309	75	36	688
Market Structure	195	201	135	30	566
Task Allocation	248	77	96	38	464
Innovation Output	300	34	55	20	411
Skill Acquisition	207	75	65	21	368
Employment Level	138	67	119	24	350
Fiscal & Macroeconomic	156	80	53	33	329
Task Completion Time	211	38	13	16	280
Firm Revenue	183	52	29	5	270
Consumer Welfare	131	77	48	13	269
Inequality Measures	50	141	54	9	254
Worker Satisfaction	104	85	25	13	227
Error Rate	87	112	11	5	215
Automation Exposure	69	69	37	20	198
Wages & Compensation	102	49	31	11	193
Team Performance	115	30	30	11	187
Regulatory Compliance	88	74	17	7	186
Training Effectiveness	109	22	14	21	168
Developer Productivity	116	21	15	8	161
Job Displacement	12	92	26	1	131
Hiring & Recruitment	57	12	9	5	83
Skill Obsolescence	6	59	10	2	77
Social Protection	43	17	8	2	70
Creative Output	35	21	9	4	70
Labor Share of Income	18	23	17	1	59
Worker Turnover	15	16	—	4	35
Industry	—	—	—	1	1

Productivity Remove filter

LLM agents can perform in silico biology tasks that previously required experienced human biologists.

Results from the ABC-Bench evaluation reported in the paper, where LLM agents completed multiple bio-relevant tasks (coding for liquid handlers, DNA fragment design, evasion of DNA screening) and outperformed a median expert human baseline.

high positive ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecu... ability of LLM agents to perform in silico biology tasks

Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data.

Statement in paper's introduction framing observed trends in LLM capabilities; general literature and examples implied but no specific quantitative study in this paper reported to support the broad trend.

high positive ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecu... acquisition of research-relevant capabilities by LLMs

The paper proposes an evolutionary framework of AI-Economy transformation and calls for further research on governance, sustainability, and inclusive growth.

Abstract states the paper suggests an evolutionary framework and points to future research directions (governance, sustainability, inclusive growth); this is a conceptual recommendation rather than an empirical result.

high positive AI Technologies and Economic Transformation: A Systematic Re... policy/research agenda recommendations

Generative AI can transform value generation by enriching cognitive work instead of automating habitual processes.

Abstract claim synthesizing reviewed literature that generative models augment cognitive work; no empirical effect sizes or study counts given in abstract.

high positive AI Technologies and Economic Transformation: A Systematic Re... change in task allocation toward cognitive augmentation

Deep Learning (DL) hastens automation and capital deepening in high-skill industries.

Synthesis claim in abstract from reviewed literature; no specific empirical estimates or sample sizes provided in abstract.

high positive AI Technologies and Economic Transformation: A Systematic Re... automation intensity / capital deepening

Machine Learning (ML) mainly boosts productivity by increasing predictive efficiency.

Synthesis claim in abstract based on the systematic review of peer-reviewed literature (Scopus and SCI); no specific empirical studies or sample sizes cited in abstract.

high positive AI Technologies and Economic Transformation: A Systematic Re... productivity (via predictive efficiency)

The review estimates sectoral, macroeconomic, and labor market effects of ML, DL, and Generative AI.

Stated scope in abstract: review used to estimate sectoral, macroeconomic, and labor market effects; no quantitative details provided in abstract.

high positive AI Technologies and Economic Transformation: A Systematic Re... scope of estimated effects (sectoral/macroeconomic/labor market)

This paper systematically reviewed peer-reviewed journal articles indexed in the Scopus and SCI databases.

Stated method in abstract: systematic review of peer-reviewed journal articles indexed in Scopus and SCI; no sample size or study count reported in abstract.

high positive AI Technologies and Economic Transformation: A Systematic Re... use of systematic review methodology / data source coverage

AI agents (like Computer) accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

Summary conclusion synthesizing the empirical findings from Perplexity production data comparisons (matched sessions, per-query dissatisfaction, completion time and cost estimates, and content analyses).

high positive How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, ... organizational efficiency, output quality, cost, and scope of automation

Computer lowers estimated time by 87% and estimated cost by 94% compared to humans equipped with Search alone.

Estimated time and cost comparisons derived from Perplexity matched-task completion times and cost models comparing Computer-enabled workflows to human workflows using Search.

high positive How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, ... estimated time and estimated cost

Computer reduces completion time from 269 to 36 minutes on matched tasks.

Matched-task analysis of Perplexity production data comparing task completion times when users used Computer versus Search.

high positive How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, ... task completion time (minutes)

Autonomy increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search.

Per-query dissatisfaction metric computed from Perplexity product usage data comparing Computer and Search queries (matched comparisons).

high positive How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, ... per-query dissatisfaction rate (a proxy for output quality)

Computer shifts follow-up query distribution toward higher-order work such as verification and extension.

Comparison of follow-up query types in Perplexity production data for matched tasks across Computer and Search sessions, showing relative increases in verification/extension queries after Computer use.

high positive How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, ... distribution of follow-up query types (verification, extension, etc.)

Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search.

Analysis of Perplexity production data comparing sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products (matched-session comparison).

high positive How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, ... autonomous work time per user session

Ultimately, this framework provides technology managers with a verifiable, evidence-based pathway toward resilient, net-zero Industry 5.0 ecosystems.

Conclusion/assertion in paper positioning the framework as a practical pathway; described qualitatively without empirical outcome measures or quantified evidence.

high positive Trustworthy Smart Fabs via Professional Proxies: Scaling Saf... pathway to resilient, net-zero Industry 5.0 ecosystems (managerial guidance for ...

The architecture demonstrates how fabs can export cryptographically signed compliance tokens via International Data Spaces (IDS) connectors without exposing proprietary process recipes.

Claim of demonstration in paper; implies a prototype or illustrative workflow using cryptographic signing and IDS connectors, but no empirical deployment, sample, or measured disclosure-risk reduction reported.

high positive Trustworthy Smart Fabs via Professional Proxies: Scaling Saf... ability to export verifiable compliance tokens while preserving recipe confident...

By executing Virtual Metrology (VM) predictions and Federated Machine Learning (FML) inside hardware-rooted Trusted Execution Environments (TEEs), this architecture resolves the Data Sovereignty Paradox.

Technical claim based on proposed use of TEEs with VM and FML in the paper; presented as conceptual/architectural resolution rather than empirically validated result.

high positive Trustworthy Smart Fabs via Professional Proxies: Scaling Saf... resolution of the Data Sovereignty Paradox (ability to use distributed models wi...

Structured as an interoperable network protocol stack, the framework coordinates an automated, five-step "relay race" between Facility, Process Engineering, and Finance proxy teams to align factory-floor yield models with macro-level sustainability mandates.

Architectural and protocol-level description in the paper (system design); no quantitative alignment metrics or empirical validation reported.

high positive Trustworthy Smart Fabs via Professional Proxies: Scaling Saf... alignment of factory-floor yield models with macro-level sustainability mandates...

We propose a shift from reactive automation to autonomous governance through "Professional Proxies"—role-based agentic workflows executing within hardware-isolated trust zones.

Design proposal and conceptual workflow model presented in the paper; no field trial or user study reported.

high positive Trustworthy Smart Fabs via Professional Proxies: Scaling Saf... adoption of autonomous governance via Professional Proxies (agentic workflows in...

We introduce a zero-trust socio-technical orchestration framework that operationalizes a six-layer SSbD reference architecture within trustworthy industrial data spaces.

Proposed system architecture described in the paper (design/proposal); no reported empirical deployment or quantitative evaluation.

high positive Trustworthy Smart Fabs via Professional Proxies: Scaling Saf... operationalization of the six-layer SSbD reference architecture within industria...

Lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

Conclusion drawn from empirical results across GEOS, OpenFOAM, and LAMMPS experiments showing SIGA enabling agents to operate simulators effectively.

high positive SIGA: Self-Evolving Coding-Agent Adapters for Scientific Sim... practicality of general coding agents operating scientific software (combined me...

Grounding can reduce the across-seed standard deviation by 16x on the held-out GEOS set.

Reported reduction in across-seed standard deviation in held-out GEOS experiments (numerical multiplier provided).

high positive SIGA: Self-Evolving Coding-Agent Adapters for Scientific Sim... across-seed standard deviation of performance

On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent.

Held-out evaluation comparing bare agent vs grounded agent on GEOS reported in the paper (numerical TreeSim scores provided).

high positive SIGA: Self-Evolving Coding-Agent Adapters for Scientific Sim... TreeSim (output quality) on held-out set

On GEOS, SIGA attains TreeSim above 0.90, matching the extended-budget human expert.

Empirical evaluation on GEOS reporting TreeSim score (quality metric) above 0.90 for SIGA and parity with a human expert.

high positive SIGA: Self-Evolving Coding-Agent Adapters for Scientific Sim... TreeSim (similarity / output-quality metric)

On GEOS, SIGA produces a complete GEOS deck in about five minutes, matching an extended-budget human expert who took about three hours—a roughly 36x wall-clock speedup.

Empirical evaluation on GEOS reported in the paper (timing comparison between SIGA and an extended-budget human expert). Sample size not stated in abstract.

high positive SIGA: Self-Evolving Coding-Agent Adapters for Scientific Sim... time to produce complete GEOS deck (wall-clock time)

SIGA (Simulator-Interface Grounding Adapter) supplies a simulator's executable contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination.

Method description in the paper introducing SIGA (architectural/components claim).

high positive SIGA: Self-Evolving Coding-Agent Adapters for Scientific Sim... capacity to encode and supply simulator interface contract

The system maintains safety guarantees through layered authorization and rollback mechanisms.

Stated safety design: layered authorization and rollback mechanisms described as part of deployment; no empirical safety metrics provided in the abstract.

high positive Autonomous Incident Resolution at Hyperscale: An Agentic AI ... maintenance of safety guarantees via authorization and rollback

Agentic AI systems can achieve autonomous resolution rates exceeding 90% for common incident categories.

Reported production result in the paper claiming >90% autonomous resolution for common incident categories; the abstract does not provide sample size, time period, or incident counts.

high positive Autonomous Incident Resolution at Hyperscale: An Agentic AI ... autonomous resolution rate (percent of incidents resolved without human interven...

The architecture has been deployed in production at a major cloud provider.

Stated deployment claim in the paper (production deployment at a major cloud provider); no further deployment metrics or number of sites provided in the abstract.

high positive Autonomous Incident Resolution at Hyperscale: An Agentic AI ... production deployment at a major cloud provider

Architectural principles include hierarchical agent decomposition, skills-based tool invocation via standardized protocols, structured knowledge encoding from operational runbooks, progressive autonomy with safety boundaries, and closed-loop verification.

Paper lists these principles as the design foundation for the architecture; descriptive claim about system design.

high positive Autonomous Incident Resolution at Hyperscale: An Agentic AI ... architectural design features employed

The system employs a multi-agent orchestration framework where specialized AI agents collaborate to detect, diagnose, and remediate network incidents without human intervention.

Architectural/system description in the paper; stated implementation details of multi-agent orchestration and agent specialization.

high positive Autonomous Incident Resolution at Hyperscale: An Agentic AI ... ability of AI agents to detect, diagnose, and remediate incidents autonomously

We present an agentic AI architecture for autonomous incident resolution in large-scale network operations.

Description of a proposed system architecture in the paper (design and architectural principles), presented as the main contribution.

high positive Autonomous Incident Resolution at Hyperscale: An Agentic AI ... capability to perform autonomous incident resolution

The AARRI-Bench dataset and code are released at https://github.com/AARR-bench/AARRI-bench.

Explicit statement in the paper providing a GitHub URL for released data.

high positive Act As a Real Researcher: A Suite of Benchmarks Evaluating F... availability of dataset and code

Our results indicate that developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding.

Conclusion/recommendation derived from the benchmark results reported in the paper.

high positive Act As a Real Researcher: A Suite of Benchmarks Evaluating F... effectiveness of approaches (behavioral modeling vs. scaffolding) for producing ...

We propose AARRI-Bench (Act As a Real Research Intern), the first benchmark in the AARR series that focuses on granular research scenarios rather than macro-level execution capabilities.

Paper introduces AARRI-Bench as the initial benchmark in the proposed series; claim is descriptive about the paper's contribution.

high positive Act As a Real Researcher: A Suite of Benchmarks Evaluating F... existence and focus of the AARRI-Bench benchmark

We conceptualize the AARR (Act As a Real Researcher) benchmark series to evaluate whether agents can emulate the professionalism, thoroughness, and nuanced reasoning of human researchers in granular research scenarios.

Paper proposes the AARR benchmark series and describes its focus and motivation; this is a methodological/conceptual contribution of the paper.

high positive Act As a Real Researcher: A Suite of Benchmarks Evaluating F... ability to emulate professionalism, thoroughness, and nuanced reasoning

Agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution.

Stated as background/motivating claim in the paper's introduction; no specific experimental details or sample sizes provided in the excerpt.

high positive Act As a Real Researcher: A Suite of Benchmarks Evaluating F... proficiency on complex, long-horizon coding tasks and autonomous experiment exec...

Organizations increasingly use intelligent systems for high-stakes strategic decision-making (SDM).

Introductory/background statement in the paper summarizing trends in practice and motivating the research; based on literature review and observed practice (no study data reported here).

high positive Shaping The Tool Or Shaping The Mind: An Investigation Of Du... adoption of intelligent systems for SDM

Given modest efficiency gains and limited costs, the authors recommend adjusting for AI-generated predictions as a regular empirical practice.

Policy/recommendation statement in the paper's abstract based on theoretical arguments, simulations, and three empirical applications; not a quantified causal claim.

high positive AI-Assisted Variance Reduction in Randomized Experiments recommended empirical practice adoption (guidance)

Overall, efficiency gains from adjusting for AI-generated predictions are real but modest, with greater benefits in studies that contain substantial text and other unstructured data.

Summary claim in abstract based on the paper's simulations and three empirical applications; no numeric effect sizes reported in abstract.

high positive AI-Assisted Variance Reduction in Randomized Experiments magnitude of efficiency gains (reduction in estimator variance)

The ideas are demonstrated in simulations and three empirical applications: a survey mega-study, an email marketing A/B test, and a large-scale technology platform experiment.

Statement in abstract claiming empirical demonstrations; the specific applications are named but sample sizes are not reported in the abstract.

high positive AI-Assisted Variance Reduction in Randomized Experiments empirical performance of the adjustment approach across simulated and real exper...

The paper provides implementation guidance, including how to obtain continuous scores from discrete LLM outputs and how to use LLMs to featurize unstructured inputs as auxiliary covariates.

Stated in abstract that the paper contains practical implementation guidance; evidence basis is descriptive (contents of the paper) rather than empirical.

high positive AI-Assisted Variance Reduction in Randomized Experiments availability of practical implementation procedures

No new estimators are required: researchers can simply include AI predictions as covariates in standard regression adjustment, analogous to adjusting for a prognostic score.

Methodological claim in the paper (abstract) presenting the approach as compatible with standard regression adjustment; presumably supported by theoretical exposition in the paper.

high positive AI-Assisted Variance Reduction in Randomized Experiments feasibility of using standard regression adjustment (methodological correctness)

AI-generated predictions can be used to reduce variance in randomized experiments by including them as covariates in regression adjustment.

Theoretical argument in the paper plus demonstrations via simulations and three empirical applications (survey mega-study, email marketing A/B test, large-scale platform experiment) as stated in the abstract.

high positive AI-Assisted Variance Reduction in Randomized Experiments variance of the randomized experiment estimator / statistical efficiency

Generative AI and large language models can produce realistic predictions of human behavior from rich, unstructured inputs with little to no task-specific training data.

Statement in paper's introduction/abstract referencing recent work that uses LLMs to predict human responses from unstructured inputs; likely supported by literature citations and examples but no sample size reported in abstract.

high positive AI-Assisted Variance Reduction in Randomized Experiments accuracy/realism of AI-generated predictions of human behavior

Introducing completeness-aware decision utility (informed decision-quality = decision-quality × gold-coverage), C reaches 7.43 vs. A 1.76 and B 2.57 on this metric.

Computed metric (decision-quality × gold-coverage) applied to agents' measured decision-quality and gold-coverage on the benchmark (13 assets).

high positive AI Scientists Are Only as Good as Their Evidence: A Stratifi... informed decision-quality (decision-quality × gold-coverage)

On the curated long-tail subset, agent C reaches 0.93 recovered coverage vs. A 0.26 and B 0.30.

Subset analysis (long-tail subset) of the curated gold record recovery fractions comparing A, B, and C; subset size not specified in abstract.

high positive AI Scientists Are Only as Good as Their Evidence: A Stratifi... fraction of curated gold record recovered on long-tail subset (gold-coverage)

Across the same benchmark, B increases objectivity from 3.16 to 3.30 (vs. A).

Measured objectivity score comparison between agent A and agent B on the 13-asset stratified benchmark.

high positive AI Scientists Are Only as Good as Their Evidence: A Stratifi... objectivity score

Across a 13-asset stratified benchmark, adding the public structured tools and playbook (B) improves tier-in-range accuracy from 0.80 to 0.89 (vs. A).

Comparison of tier-in-range accuracy between agent A and agent B on the 13-asset stratified benchmark (controlled ablation).

high positive AI Scientists Are Only as Good as Their Evidence: A Stratifi... tier-in-range accuracy

For knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access.

Controlled three-arm ablation experiment on a production drug-asset valuation agent across a 13-asset stratified benchmark comparing variants with differing access to structured/public/proprietary evidence.

high positive AI Scientists Are Only as Good as Their Evidence: A Stratifi... AI Scientist capability limited by available evidence substrate (affecting decis...

« Prev 1 2 3 … 71 72 73 … 179 180 Next »