The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

A unified standard for AI evaluation RCTs: five validity principles and 33 operational guidelines to ensure causal credibility, repeatability and equitable assessment when measuring how AI affects human performance.

Principles and Guidelines for Randomized Controlled Trials in AI Evaluation
Christopher Kelly, Angelica Chowdhury, Alexandra Campili, Bimpe Ayoola, Devin Barbour, Thomas Chen Dawson, Ze Shen Chin, Rokas Gipiškis · May 03, 2026
arxiv theoretical n/a evidence 8/10 relevance Source PDF
Proposes a five-principle, 33-guideline framework that adapts RCT validity and TOP transparency standards to standardize human-focused AI evaluation trials and support robust causal measurement of AI's effects on human performance.

This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies). Drawing on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology, we adopt the (Shadish et al., 2002) four-validity framework and extend it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025). We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases. We position the principles and guidelines as serving three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms. Our framework extends prior work by centering evaluation on human performance rather than model output alone, formalizing causal inference through RCT methodology for AI contexts, integrating heterogeneity analysis and practical significance assessment, implementing a graded transparency and repeatability framework, and addressing AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.

Summary

Main Finding

The paper proposes a principled, operational framework for randomized controlled trials (RCTs) that evaluate the causal effects of AI access on human outcomes. It adapts the Shadish‑Cook‑Campbell four‑validity framework (construct, internal, external, statistical conclusion) and adds a fifth principle on transparency, repeatability, and verification (from TOP). These five principles are translated into 33 actionable guidelines tailored to AI evaluation RCTs, addressing AI‑specific threats (model versioning, spillovers/contamination, human–AI interaction, proprietary/closed models) and prescribing practices (pre‑specification, power for heterogeneity, graded transparency levels, effect‑size emphasis) to improve credibility and policy relevance.

Key Points

  • Principles: Five core principles — Construct validity, Internal validity, External validity, Statistical conclusion validity, and Transparency/Repeatability/Verification.
  • Operationalization: 33 guidelines expressed as requirements with rationales, implementation instructions, and evidence bases.
  • AI‑specific threats emphasized: model updates/versioning, contamination and spillovers (interference), changing task structure due to AI, hidden researcher degrees of freedom (prompts, interface), and closed APIs limiting reproducibility.
  • Construct validity concerns: common metrics (speed, token counts, fluency) can misrepresent true capability changes; avoid construct underrepresentation and construct‑irrelevant variance.
  • Internal validity solutions: align unit of randomization with interference structure, monitor compliance, pin model versions, guard against unsanctioned model use.
  • External validity cautions: define target populations and contexts; beware WEIRD/convenience samples and short artificial tasks that don’t reflect real work.
  • Statistical practice: shift from binary hypothesis testing toward estimation; require effect sizes, confidence intervals, sensitivity analyses; they propose α = 0.005 for novel causal claims and explicit treatment of multiple outcomes and heterogeneity.
  • Transparency framework: four levels —
  • Disclosure (report what practices were done),
  • Sharing (deposit outputs/data/code),
  • Verification (independent confirmation that claims match artifacts),
  • Repeatability (reproducibility, robustness, replicability as three distinct dimensions).
  • Roles of the framework: design tool for planning RCTs, rubric for assessing existing studies, and blueprint for setting domain norms and standards.

Data & Methods

  • Nature of the work: conceptual/methodological paper that synthesizes methods and standards across disciplines and operationalizes them for AI evaluation RCTs.
  • Sources and evidence base:
    • Theoretical foundation: Shadish et al. (2002) four‑validity framework plus the TOP Guidelines (Center for Open Science).
    • Cross‑disciplinary standards reviewed: CONSORT/CONSORT‑AI, SPIRIT/SPIRIT‑AI, STARE‑HI, STREAM, and software engineering reporting standards.
    • Empirical and meta‑science evidence: literature on replication, reproducibility, reporting practices, and reanalysis statistics (e.g., Nosek et al., Aczel et al., Tyner et al.).
    • Practitioner input: draws on documented practitioner challenges (cited interviews/surveys such as Paskov et al., 2026) and domain examples.
  • Operationalization: translated high‑level principles into 33 concrete guidelines; provided appendices with example threats and language for claims; specified a graded transparency/verification protocol and prescriptive statistical thresholds (e.g., α = 0.005).
  • Not an original large‑N empirical trial: no single RCT dataset analyzed; instead provides normative guidance, checklists, and recommended practices for future empirical work.

Implications for AI Economics

  • Improved causal inference for economic outcomes: the framework clarifies how to credibly estimate causal effects of AI access on productivity, task choice, labor supply, wages, skills acquisition, and wellbeing — outcomes central to labor and applied microeconomics.
  • Treatment definition and versioning: economists must precisely define the treatment (e.g., “access to model X vY with prompts P”) and pin versions or document update policies; otherwise treatment drift biases effect estimates and undermines comparability across studies.
  • Interference and randomization unit: many economic settings have spillovers (peer learning, shared tools). The guideline to align randomization with interference suggests cluster RCTs, staggered rollouts, or network‑aware designs to estimate direct and spillover effects correctly.
  • Heterogeneity and power: economic questions often hinge on heterogeneous effects across skills, occupations, or firms. The paper’s emphasis on powering for heterogeneity and requiring effect‑size reporting advises larger samples or stratified designs and pre‑specified interaction analyses to detect economically meaningful differences.
  • Practical significance over p-values: adopting estimation with confidence intervals and lower α for novel claims (α = 0.005) pushes economists to focus on magnitude (e.g., percentage productivity change, wage elasticities) and policy relevance rather than binary significance.
  • Transparency & firm cooperation: reproducibility requirements (logs, prompts, API metadata) improve credibility but may conflict with firm confidentiality. Economics researchers should plan data agreements that allow sharing of analysis‑enabling artifacts (e.g., redacted logs, synthetic datasets, pre‑analysis plans, independent verification by trusted third parties).
  • Measurement choices: construct validity guidance highlights that common proxies (speed, completion rates) may not capture welfare or skill changes. Economists should tie outcomes to economic primitives (earnings, error rates with monetary consequences, downstream task completion) and include robustness checks for construct‑irrelevant improvements (e.g., verbosity without quality).
  • Policy and aggregation: standardized RCT reporting and transparency will make meta‑analysis and policy synthesis feasible (aggregating effect sizes across settings), improving regulatory and welfare assessments (deployment thresholds, training subsidies, labor market interventions).
  • Practical recommendations for economists designing AI RCTs:
    • Pre‑register protocol and analysis plan; disclose primary and secondary outcomes and multiple comparison adjustments.
    • Pin model version and record model‑interaction metadata (prompts, temperature, API version); log usage for compliance and contamination detection.
    • Choose randomization unit to reflect realistic interference; plan cluster designs or randomized encouragement when direct withholding is infeasible.
    • Power calculations should account for clustering and heterogeneity; pre‑specify minimum detectable economic effect sizes.
    • Report effect sizes, CIs, and conduct sensitivity/robustness checks; run alternative analyses (robustness) and share code/data where feasible.
    • Aim for at least Level 2 transparency (sharing) and, when possible, independent verification (Level 3) for high‑stakes claims.
  • Limitations to bear in mind: the framework is normative and general; implementations must be adapted to context (labor markets vs. education vs. firm productivity). Proprietary model constraints will require negotiated compromises (e.g., secure enclaves, third‑party verification) that satisfy both reproducibility and confidentiality.

Summary takeaway for AI economics: adopting these principles will raise the credibility and comparability of causal evidence on AI’s economic impacts, but requires early planning around randomization, interference, treatment definition/versioning, measurement of economically meaningful outcomes, sufficient power for heterogeneous effects, and workable transparency agreements with providers.

Assessment

Paper Typetheoretical Evidence Strengthn/a — This is a methodological/framework paper that proposes standards and guidelines rather than reporting new empirical estimates or experimental results. Methods Rigorhigh — Framework is grounded in established experimental validity literature (Shadish et al.) and TOP transparency standards, operationalizes principles into 33 actionable guidelines with rationales and implementation notes, and explicitly addresses common threats (contamination, versioning, heterogeneity, equity), indicating careful, multidisciplinary synthesis. SampleNo empirical sample; the paper synthesizes methods, best practices, and standards from multiple disciplines (software engineering, economics, clinical/health sciences, psychology) and existing guidelines (Shadish et al., TOP Guidelines) to operationalize RCT design and transparency for AI-human evaluation contexts. Themeshuman_ai_collab productivity governance IdentificationAdvocates randomized controlled trials (RCTs) as the primary causal identification strategy: random assignment to treatment/control, pre-specification (pre-analysis plans), blocking/stratification, handling contamination and spillovers, intent-to-treat and compliance analysis, heterogeneity and practical-significance assessment, and versioning protocols to preserve identification in dynamic model deployments. GeneralizabilityDesigned for human-centered AI evaluation RCTs; less applicable to purely algorithmic or model-only benchmarking, Requires feasibility of randomization—limited relevance where random assignment is infeasible or unethical, Resource and operational constraints in many industry/field settings may restrict full adherence (e.g., model versioning logistics, large-sample needs for heterogeneity analysis), Domain-specific regulatory, privacy, or safety constraints (healthcare, finance) may necessitate adaptations, Fast model iteration or deployment-as-a-service settings may complicate repeatability and exact protocol replication

Claims (10)

ClaimDirectionConfidenceOutcomeDetails
This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies). Research Productivity positive high standardization of AI evaluation RCTs / evaluation methodology
0.12
The framework draws on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology. Research Productivity positive high methodological comprehensiveness / interdisciplinary grounding
0.12
The paper adopts the (Shadish et al., 2002) four-validity framework and extends it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025). Research Productivity positive high methodological framework / validity criteria
0.2
We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases. Research Productivity positive high availability of operational guidelines for AI RCTs
0.2
The principles and guidelines serve three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms. Research Productivity positive high utility of the framework in planning, evaluating, and standard-setting
0.12
Our framework extends prior work by centering evaluation on human performance rather than model output alone. Research Productivity positive high focus of evaluation metrics (human performance vs. model output)
0.12
The framework formalizes causal inference through RCT methodology for AI contexts. Research Productivity positive high use of RCTs to support causal inference in AI evaluations
0.12
The framework integrates heterogeneity analysis and practical significance assessment. Research Productivity positive high inclusion of heterogeneity and practical significance analysis in evaluation practice
0.12
The framework implements a graded transparency and repeatability framework. Research Productivity positive high graded transparency and repeatability practices for AI RCTs
0.12
The framework addresses AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment. Research Productivity positive high coverage of AI-specific methodological challenges in evaluation guidelines
0.12

Notes