Evidence (7278 claims)
Search and filter individual claims pulled from the papers. Looking for a specific finding ("what's the effect on wages?"), you're in the right place. Want to compare whole outcome categories against each other instead? Use the Evidence Explorer.
The board below groups claims two ways: by broad theme (nine paper-level topics) and by outcome category (the 34 claim-level outcomes that the Explorer and Syntheses also use).
Browse by theme
Nine broad, paper-level topics. Click one to filter the claims below.
Adoption
9047 claims
Filter claims →
Productivity
8066 claims
Filter claims →
Governance
7278 claims
Filtered →
Human-AI Collaboration
6912 claims
Filter claims →
Org Design
4439 claims
Filter claims →
Innovation
4359 claims
Filter claims →
Labor Markets
3652 claims
Filter claims →
Skills & Training
3018 claims
Filter claims →
Inequality
2160 claims
Filter claims →
Claims by outcome category
Counts by direction of finding. These are the same 34 outcome categories the Explorer compares and the Syntheses are written for. A linked row has a published synthesis.
| Outcome | Positive | Negative | Mixed | Null | Total |
|---|---|---|---|---|---|
| Other | 795 | 210 | 105 | 955 | 2131 |
| Governance & Regulation | 886 | 414 | 197 | 126 | 1654 |
| Organizational Efficiency | 826 | 204 | 129 | 87 | 1257 |
| Technology Adoption Rate | 681 | 259 | 128 | 110 | 1189 |
| Research Productivity | 464 | 138 | 65 | 349 | 1028 |
| Output Quality | 503 | 196 | 61 | 53 | 813 |
| Decision Quality | 351 | 180 | 84 | 51 | 673 |
| AI Safety & Ethics | 238 | 288 | 71 | 34 | 637 |
| Firm Productivity | 455 | 58 | 92 | 20 | 631 |
| Market Structure | 186 | 172 | 123 | 25 | 511 |
| Task Allocation | 222 | 70 | 76 | 34 | 407 |
| Innovation Output | 238 | 28 | 48 | 18 | 334 |
| Skill Acquisition | 177 | 62 | 62 | 17 | 318 |
| Employment Level | 107 | 57 | 108 | 13 | 287 |
| Fiscal & Macroeconomic | 135 | 72 | 44 | 26 | 284 |
| Firm Revenue | 172 | 50 | 28 | 5 | 256 |
| Consumer Welfare | 121 | 68 | 45 | 12 | 246 |
| Task Completion Time | 183 | 33 | 10 | 13 | 240 |
| Inequality Measures | 45 | 126 | 50 | 6 | 227 |
| Worker Satisfaction | 95 | 74 | 23 | 12 | 204 |
| Error Rate | 77 | 98 | 11 | 4 | 190 |
| Regulatory Compliance | 84 | 73 | 17 | 7 | 181 |
| Automation Exposure | 61 | 61 | 27 | 14 | 166 |
| Training Effectiveness | 98 | 21 | 14 | 19 | 154 |
| Wages & Compensation | 78 | 37 | 25 | 6 | 146 |
| Developer Productivity | 105 | 18 | 14 | 6 | 144 |
| Team Performance | 87 | 17 | 28 | 10 | 143 |
| Job Displacement | 12 | 83 | 23 | 1 | 119 |
| Hiring & Recruitment | 53 | 8 | 8 | 3 | 72 |
| Social Protection | 39 | 17 | 8 | 2 | 66 |
| Creative Output | 32 | 20 | 8 | 3 | 64 |
| Skill Obsolescence | 5 | 50 | 6 | 1 | 62 |
| Labor Share of Income | 17 | 20 | 17 | — | 54 |
| Worker Turnover | 15 | 15 | — | 3 | 33 |
| Industry | — | — | — | 1 | 1 |
Governance
Remove filter
The emergence of "Joint Agency" in corporate governance, where generative AI (GenAI) and human leaders collaborate, enhances Strategic Decision Quality (SDQ).
Paper presents this as a central theoretical claim and summarizes findings supporting it; no empirical sample size, statistical tests, or controlled experiment details provided in the summary.
Scope-based AIGC disclosure shapes consumer responses through two parallel mechanisms: perceived diagnosticity and perceived seller effort.
Hypothesized mediation mechanisms stated by authors; presented as theoretical predictions to be tested in planned lab experiment (no empirical results provided).
Scope-based AIGC disclosure can be conceptualized, drawing on cue utilization theory, as an information cue that clarifies which part of content was created by AI.
Theoretical framing presented in the paper (conceptual claim); no empirical test reported in the provided text.
Scope-based AIGC disclosure can mitigate the adverse effects (reduced trust and purchase intentions) of general AI-use disclosures in e-commerce contexts.
Hypothesized proposition in the paper; authors state they will examine whether and how scope-based disclosure mitigates adverse effects via planned lab experiment (no results provided).
Regulatory authorities have begun requiring firms to disclose their involvement with AI to enhance transparency and prevent deception.
Stated as background/context in the paper (policy trend); no specific laws, jurisdictions, or sample cited in the text provided.
The paper concludes with a research agenda advocating for a polycentric AI commons.
Concluding section of the paper proposing directions for future research and policy emphasizing polycentric commons governance approaches.
Energy and the sustainability of computation should be treated as a first-class commons-governance problem rather than merely as an externality.
Normative and conceptual argument in the paper advocating re-framing energy/sustainability concerns within a commons governance perspective.
The authors synthesize the positions of these archetypes through a maturity matrix and provide a comparative reading of them against Ostrom's eight design principles.
Methodological description in the paper: creation of a maturity matrix and comparative analysis using Ostromian principles.
They identify (locate) ten recurrent institutional archetypes of commons-governed AI within the taxonomy.
Synthesis of the literature and taxonomy mapping that yields ten archetypes explicitly reported in the paper.
The authors populate the taxonomy by examining the published evidence layer by layer (i.e., they perform a literature-based mapping of empirical and descriptive cases onto the taxonomy).
Paper methodology: literature review and evidence synthesis applied to each resource layer.
The taxonomy distinguishes the following resource layers held in common: data, compute, models, knowledge and evaluation, and energy.
Explicit listing of taxonomy axes and categories in the paper.
The paper contributes a two-dimensional taxonomy for commons-governed AI: one axis is the resource layer of the AI stack held in common (data, compute, models, knowledge & evaluation, energy) and the other axis is the governance function performed (derived from Ostrom design principles).
Methodological contribution described in the paper: taxonomy construction based on conceptual analysis and mapping of resources and governance functions.
The analytic vocabulary developed by Elinor Ostrom and her successors for common-pool and knowledge commons is the appropriate backbone for classifying commons-governed AI arrangements.
Theoretical argument and comparative mapping between Ostrom's common-pool/knowledge commons literature and observed AI commons arrangements.
These collective, self-organized arrangements constitute a coherent institutional family, which the authors label 'commons-governed artificial intelligence'.
Conceptual synthesis and taxonomy proposed in the paper that groups multiple observed arrangements under a common label.
The governance of artificial intelligence is overwhelmingly theorized through two institutional frames: a market frame (private goods exchanged under property and contract) and a state frame (a regulator imposes rules from above).
Argument based on a literature review and conceptual overview presented in the paper's introduction framing AI governance scholarship into two dominant institutional perspectives.
We release code, labels, and audit sheets.
Statement in paper asserting public release of code, labels, and audit sheets (no link or size provided in snippet).
On 1,000 real SWE-smith traces, trace-conditioned controls reduce CVaR95 by 72%.
Empirical evaluation on 1,000 real software engineer traces (sample size = 1,000) measuring reduction in CVaR95 (tail risk) under trace-conditioned controls.
A 300-trace expert audit accepts 295 labels unchanged.
Empirical audit reported in paper: expert audit on 300 traces with 295 labels unchanged (sample size = 300).
In our trace-to-loss testbed, trace-economic pricing reduces pricing MAE from $17.7K to $569 and removes regressive cross-subsidy.
Empirical evaluation on the paper's trace-to-loss testbed (testbed reported in paper; specific sample size for testbed not stated in snippet).
We introduce trace-economic underwriting, which maps tool-use traces to customer exposure and claimable loss, then uses this representation for pricing, control, and risk transfer, using deterministic economic labels rather than an LLM judge.
Description of the proposed method in the paper (methodological contribution; no empirical sample size reported here).
Automation can be made economically acceptable when its expected benefit exceeds the insurance premium, control cost, and remaining risk.
Conceptual/analytic claim from paper proposing an economic acceptability criterion (theoretical framing, no sample size reported).
DOMUS is replicable digital public infrastructure: a modular, cloud-native Software-as-a-Service architecture that can be deployed across other UK boroughs and adapted to other public administration tasks characterised by scarcity, rule-bound eligibility, and high stakes.
Design and architectural claim in the paper about modular cloud-native SaaS architecture and intended replicability; this is an assertion about generalisability rather than evidence from multi-site deployments.
The deployment maintained statutory compliance and role-based accountability.
Paper asserts that DOMUS operations preserved statutory compliance and role-based accountability during the pilot (claimed based on system design and pilot evaluation; no specific compliance audits or metrics provided in the provided text).
Results indicate high staff satisfaction.
User feedback / staff satisfaction reported from the pilot deployment (paper states high satisfaction but does not report sample size or survey statistics in the provided text).
Results indicate improved adherence to key placement constraints.
Pilot evaluation results reporting better adherence to placement constraints (e.g., bedroom need, affordability, accessibility) under DOMUS compared to manual workflows (no quantitative metrics provided in the provided text).
Results indicate substantial reductions in search time.
Findings from the pilot deployment comparing DOMUS-assisted search time to manual search workflows (paper reports reductions but does not provide numeric effect size in the provided text).
Household and property attributes are encoded into policy-consistent representations prior to AI-assisted ranking and explanation.
Technical design detail in the paper describing preprocessing/encoding pipeline used by DOMUS before AI ranking and explanation.
The system combines transparent, rule-based filtering with large language model-assisted search to standardise the application of bedroom need, affordability thresholds, geographic preferences, and accessibility requirements, while preserving officer discretion and audibility.
Design and functionality description; statement that DOMUS uses rule-based filtering plus LLM-assisted search to standardise applications of policy rules and preserve discretion/audibility.
DOMUS integrates household case records, policy-constrained affordability and suitability rules, and live private-rental listings within a single governance-aligned workflow.
System architecture and functionality described in the paper; design claim about integrated data sources and workflow alignment.
The paper documents the creation and use of DOMUS, a cloud-based, AI-enabled decision-support system built from scratch at the University of East London and customised for the needs of London Borough of Newham to support statutory Temporary accommodation placement.
Implementation and deployment description in the paper (system development and customised pilot deployment for Newham).
In a real-world fundraising study, AI was nearly 3x more effective than professional canvassers from a UK fundraising firm at raising real-money donations to Save the Children.
Final study reported in the paper: field experiment with professional canvassers from a UK fundraising firm compared to AI, measuring actual donations to Save the Children; abstract reports AI 'nearly 3x' more effective.
Converging evidence indicates AI's advantage stemmed from rapidly deploying larger quantities of information.
Analyses reported in the paper (described as 'converging evidence') linking the AI advantage to the AI's ability to produce larger amounts of information quickly; also an experimental manipulation constraining AI to human speeds/lengths where human performance matched AI.
AI's persuasive advantage persisted after experts received a coaching tool that let them practice against the AI, review their performance history, and see what AI would have said at key moments.
Follow-up study reported in the paper in which experts were given a coaching tool (practice vs AI, performance review, AI counterfactuals) and AI maintained its advantage.
AI systems out‑persuaded expert humans even when expert humans chose their issues, researched in advance, underwent hours of live, structured practice, and were incentivized with £1,000 cash bonuses.
Experimental comparison reported in the preregistered studies where expert humans were allowed advance research, practiced live for hours, selected issues, and received financial incentives (£1,000) while being compared to AI systems.
Frontier AI systems were reliably more persuasive than expert humans across a series of four preregistered experiments.
Four preregistered experiments reported in the paper comparing AI systems to multiple human groups (laypeople, persuasion tournament winners, professional canvassers, world championship debaters); aggregate reported sample: n = 18,978 conversations from 6,923 people.
Beyond a turning point, higher AI investment is associated with higher ICD risk.
Substantive component of the reported U-shaped relationship estimated in regressions on the 41,725 firm-year sample; paper states increases in ICD risk beyond a turning point.
Overall, guest rating and price dominate other attributes in driving assistant recommendations, reproducing human valence-and-price primacy.
AMCE estimates from the randomized conjoint showing largest effects for guest rating and price, with the authors' interpretation that this pattern mirrors human decision patterns (valence and price primacy).
List position — a content-free artifact — causally shifts recommendations and has an effect equivalent to about $12 per night.
Randomized manipulation of list position in conjoint; reported causal effect of list position translated into a dollar-equivalent of approximately $12 per night.
Guest rating strongly increases the probability an assistant recommends a hotel: a top rating raises selection by 31.6 percentage points.
Randomized conjoint experiment estimating average marginal component effects (AMCE) of hotel attributes on recommendation probability; reported point estimate of +31.6 percentage points for top guest rating.
The result is an incentive-compatibility layer for actuarial control of autonomous-agent side effects.
Synthesis of the paper's theoretical results (attack-space characterization, clause designs, composition proofs) and the companion empirical validation supporting one theorem.
A two-parameter premium family discharges operator individual rationality and weak budget balance at the truthful equilibrium.
Constructive theoretical result: definition of a two-parameter premium family and proof that it satisfies operator individual rationality (IR) and weak budget balance when operators report truthfully.
We compose these clauses with Paper A's runtime guarantees to obtain joint incentive compatibility over the five-attack space.
Analytical composition proofs combining the newly proposed contract clauses with Paper A's actuarial runtime results to demonstrate joint incentive-compatibility across the enumerated attack vectors.
A model-identity menu with a componentwise-minimum penalty schedule makes truthful reporting of the deployed model weakly dominant.
Formal mechanism-design style proof showing that the specified model-identity menu and penalty schedule render truthful reporting a weakly dominant strategy for the operator.
Interface failures such as invalid JSON are contract-relevant events, not safety wins: treating them as zero-toll safe defaults can reward unreliable models, while escalation fees reverse the incentive.
Theoretical 'interface-compliance theorem' in the paper plus an argument that escalation fees change incentives; supplemented by empirical validation on committed cross-model traces (from a companion empirical paper).
Common-control aggregation prevents cross-boundary re-routing from reducing toll below the boundary potential applied to total exposure.
Theoretical construction and proof of a contract clause (common-control aggregation) that enforces boundary-potential tolling on aggregate exposure to block cross-boundary re-routing attacks.
Two attack surfaces -- post-toll safe-default selection and within-boundary action splitting -- are closed by Paper A's minimal-authority and no-splitting clauses.
Analytical argument / formal proof showing that the minimal-authority and no-splitting clauses eliminate these two specific attack vectors.
We characterise a five-attack space for autonomous AI-agent insurance contracts and prove when the actuarial runtime is gaming-resistant.
Formal theoretical characterization and proofs presented in the paper (analysis of attack surfaces and sufficient conditions for the actuarial runtime to be resistant to strategic gaming).
Over the past decade, U.S. policies have increasingly aimed to preserve artificial intelligence (AI) leadership by promoting domestic free-market policies while controlling global technological chokepoints, particularly advanced semiconductors and computational infrastructure.
Policy analysis / historical review of U.S. policy measures (export controls, industrial policy, statements by policymakers) described in the paper; no sample size reported in the excerpt.
Code, data, and harness are released under open licenses, with an anonymized review artifact.
Paper states release of code, data, and harness under open licenses and inclusion of an anonymized review artifact as part of the release.
The paper contributes a sound, sensitive, adoption-ready instrument, the arity-matched null methodology, and open artifacts to scale it.
Authors' stated contributions: instrument, statistical methodology (arity-matched null), and open-source artifacts (code/data/harness) described in paper and released under open licenses.