A unified standard for AI evaluation RCTs: five validity principles and 33 operational guidelines to ensure causal credibility, repeatability and equitable assessment when measuring how AI affects human performance.
This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies). Drawing on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology, we adopt the (Shadish et al., 2002) four-validity framework and extend it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025). We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases. We position the principles and guidelines as serving three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms. Our framework extends prior work by centering evaluation on human performance rather than model output alone, formalizing causal inference through RCT methodology for AI contexts, integrating heterogeneity analysis and practical significance assessment, implementing a graded transparency and repeatability framework, and addressing AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment.
Summary
Main Finding
The paper proposes a principled, operational framework for randomized controlled trials (RCTs) that evaluate the causal effects of AI access on human outcomes. It adapts the Shadish‑Cook‑Campbell four‑validity framework (construct, internal, external, statistical conclusion) and adds a fifth principle on transparency, repeatability, and verification (from TOP). These five principles are translated into 33 actionable guidelines tailored to AI evaluation RCTs, addressing AI‑specific threats (model versioning, spillovers/contamination, human–AI interaction, proprietary/closed models) and prescribing practices (pre‑specification, power for heterogeneity, graded transparency levels, effect‑size emphasis) to improve credibility and policy relevance.
Key Points
- Principles: Five core principles — Construct validity, Internal validity, External validity, Statistical conclusion validity, and Transparency/Repeatability/Verification.
- Operationalization: 33 guidelines expressed as requirements with rationales, implementation instructions, and evidence bases.
- AI‑specific threats emphasized: model updates/versioning, contamination and spillovers (interference), changing task structure due to AI, hidden researcher degrees of freedom (prompts, interface), and closed APIs limiting reproducibility.
- Construct validity concerns: common metrics (speed, token counts, fluency) can misrepresent true capability changes; avoid construct underrepresentation and construct‑irrelevant variance.
- Internal validity solutions: align unit of randomization with interference structure, monitor compliance, pin model versions, guard against unsanctioned model use.
- External validity cautions: define target populations and contexts; beware WEIRD/convenience samples and short artificial tasks that don’t reflect real work.
- Statistical practice: shift from binary hypothesis testing toward estimation; require effect sizes, confidence intervals, sensitivity analyses; they propose α = 0.005 for novel causal claims and explicit treatment of multiple outcomes and heterogeneity.
- Transparency framework: four levels —
- Disclosure (report what practices were done),
- Sharing (deposit outputs/data/code),
- Verification (independent confirmation that claims match artifacts),
- Repeatability (reproducibility, robustness, replicability as three distinct dimensions).
- Roles of the framework: design tool for planning RCTs, rubric for assessing existing studies, and blueprint for setting domain norms and standards.
Data & Methods
- Nature of the work: conceptual/methodological paper that synthesizes methods and standards across disciplines and operationalizes them for AI evaluation RCTs.
- Sources and evidence base:
- Theoretical foundation: Shadish et al. (2002) four‑validity framework plus the TOP Guidelines (Center for Open Science).
- Cross‑disciplinary standards reviewed: CONSORT/CONSORT‑AI, SPIRIT/SPIRIT‑AI, STARE‑HI, STREAM, and software engineering reporting standards.
- Empirical and meta‑science evidence: literature on replication, reproducibility, reporting practices, and reanalysis statistics (e.g., Nosek et al., Aczel et al., Tyner et al.).
- Practitioner input: draws on documented practitioner challenges (cited interviews/surveys such as Paskov et al., 2026) and domain examples.
- Operationalization: translated high‑level principles into 33 concrete guidelines; provided appendices with example threats and language for claims; specified a graded transparency/verification protocol and prescriptive statistical thresholds (e.g., α = 0.005).
- Not an original large‑N empirical trial: no single RCT dataset analyzed; instead provides normative guidance, checklists, and recommended practices for future empirical work.
Implications for AI Economics
- Improved causal inference for economic outcomes: the framework clarifies how to credibly estimate causal effects of AI access on productivity, task choice, labor supply, wages, skills acquisition, and wellbeing — outcomes central to labor and applied microeconomics.
- Treatment definition and versioning: economists must precisely define the treatment (e.g., “access to model X vY with prompts P”) and pin versions or document update policies; otherwise treatment drift biases effect estimates and undermines comparability across studies.
- Interference and randomization unit: many economic settings have spillovers (peer learning, shared tools). The guideline to align randomization with interference suggests cluster RCTs, staggered rollouts, or network‑aware designs to estimate direct and spillover effects correctly.
- Heterogeneity and power: economic questions often hinge on heterogeneous effects across skills, occupations, or firms. The paper’s emphasis on powering for heterogeneity and requiring effect‑size reporting advises larger samples or stratified designs and pre‑specified interaction analyses to detect economically meaningful differences.
- Practical significance over p-values: adopting estimation with confidence intervals and lower α for novel claims (α = 0.005) pushes economists to focus on magnitude (e.g., percentage productivity change, wage elasticities) and policy relevance rather than binary significance.
- Transparency & firm cooperation: reproducibility requirements (logs, prompts, API metadata) improve credibility but may conflict with firm confidentiality. Economics researchers should plan data agreements that allow sharing of analysis‑enabling artifacts (e.g., redacted logs, synthetic datasets, pre‑analysis plans, independent verification by trusted third parties).
- Measurement choices: construct validity guidance highlights that common proxies (speed, completion rates) may not capture welfare or skill changes. Economists should tie outcomes to economic primitives (earnings, error rates with monetary consequences, downstream task completion) and include robustness checks for construct‑irrelevant improvements (e.g., verbosity without quality).
- Policy and aggregation: standardized RCT reporting and transparency will make meta‑analysis and policy synthesis feasible (aggregating effect sizes across settings), improving regulatory and welfare assessments (deployment thresholds, training subsidies, labor market interventions).
- Practical recommendations for economists designing AI RCTs:
- Pre‑register protocol and analysis plan; disclose primary and secondary outcomes and multiple comparison adjustments.
- Pin model version and record model‑interaction metadata (prompts, temperature, API version); log usage for compliance and contamination detection.
- Choose randomization unit to reflect realistic interference; plan cluster designs or randomized encouragement when direct withholding is infeasible.
- Power calculations should account for clustering and heterogeneity; pre‑specify minimum detectable economic effect sizes.
- Report effect sizes, CIs, and conduct sensitivity/robustness checks; run alternative analyses (robustness) and share code/data where feasible.
- Aim for at least Level 2 transparency (sharing) and, when possible, independent verification (Level 3) for high‑stakes claims.
- Limitations to bear in mind: the framework is normative and general; implementations must be adapted to context (labor markets vs. education vs. firm productivity). Proprietary model constraints will require negotiated compromises (e.g., secure enclaves, third‑party verification) that satisfy both reproducibility and confidentiality.
Summary takeaway for AI economics: adopting these principles will raise the credibility and comparability of causal evidence on AI’s economic impacts, but requires early planning around randomization, interference, treatment definition/versioning, measurement of economically meaningful outcomes, sufficient power for heterogeneous effects, and workable transparency agreements with providers.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| This work establishes a foundational framework for standardizing AI evaluation RCTs (sometimes called human uplift studies). Research Productivity | positive | high | standardization of AI evaluation RCTs / evaluation methodology |
0.12
|
| The framework draws on established experimental practices from disciplines with established RCT traditions, including software engineering, economics, clinical and health sciences, and psychology. Research Productivity | positive | high | methodological comprehensiveness / interdisciplinary grounding |
0.12
|
| The paper adopts the (Shadish et al., 2002) four-validity framework and extends it with a fifth principle on transparency, repeatability, and verification adapted from the Transparency and Openness Promotion (TOP) Guidelines (Center for Open Science, 2025). Research Productivity | positive | high | methodological framework / validity criteria |
0.2
|
| We operationalize all five principles into 33 guidelines adapted for AI evaluation RCT contexts, expressed as requirements with rationales, implementation instructions, and evidence bases. Research Productivity | positive | high | availability of operational guidelines for AI RCTs |
0.2
|
| The principles and guidelines serve three key roles for AI evaluation RCTs: a design tool for planning studies, an evaluation rubric for assessing existing work, and a blueprint for standard setting as the field converges on norms. Research Productivity | positive | high | utility of the framework in planning, evaluating, and standard-setting |
0.12
|
| Our framework extends prior work by centering evaluation on human performance rather than model output alone. Research Productivity | positive | high | focus of evaluation metrics (human performance vs. model output) |
0.12
|
| The framework formalizes causal inference through RCT methodology for AI contexts. Research Productivity | positive | high | use of RCTs to support causal inference in AI evaluations |
0.12
|
| The framework integrates heterogeneity analysis and practical significance assessment. Research Productivity | positive | high | inclusion of heterogeneity and practical significance analysis in evaluation practice |
0.12
|
| The framework implements a graded transparency and repeatability framework. Research Productivity | positive | high | graded transparency and repeatability practices for AI RCTs |
0.12
|
| The framework addresses AI-specific challenges including model versioning, human-AI interaction dynamics, contamination and spillover effects, and equitable impact assessment. Research Productivity | positive | high | coverage of AI-specific methodological challenges in evaluation guidelines |
0.12
|