AI-Assisted Variance Reduction in Randomized Experiments

Generative AI and large language models can produce realistic predictions of human behavior from rich, unstructured inputs with little to no task-specific training data. Recent work uses these ``digital twin'' predictions to supplement human responses in surveys and experiments. We study the special case of using AI-generated predictions to reduce variance in randomized experiments. We argue that doing so requires no new estimators and that researchers can simply include AI predictions as covariates in standard regression adjustment, analogous to adjusting for a prognostic score. A benefit of this approach is a ``do no harm'' property whereby the adjusted estimator reverts to the unadjusted difference in means when predictions are uninformative. Other methods, such as variants of prediction-powered inference, do not have this guarantee. We provide implementation guidance, including how to obtain continuous scores from discrete LLM outputs and how to use LLMs to featurize unstructured inputs as auxiliary covariates. We demonstrate these ideas in simulations and three empirical applications: a survey mega-study, an email marketing A/B test, and a large-scale technology platform experiment. Overall, efficiency gains are real if modest, with greater benefits in studies that contain substantial text and other unstructured data. We also confirm the do no harm property empirically. Given these gains and limited costs, we recommend adjusting for AI-generated predictions as a regular empirical practice.

Summary

Main Finding

Including AI-generated predictions (e.g., LLM “digital twin” outputs or other black-box model forecasts) as covariates in standard regression adjustment reliably reduces variance in randomized experiments and carries a practical “do no harm” guarantee: when predictions are uninformative the adjusted estimator reverts to the unadjusted difference-in-means. No new estimators are required — calibrated regression adjustment (equivalently, adjustment for an estimated prognostic score or a linear-calibrated form of PPI/AIPW) gives unbiased (or asymptotically unbiased) treatment effect estimates with weakly smaller variance.

Key Points

Core idea: treat AI predictions Æ_i(z) as auxiliary covariates and include them in regression adjustment. This is analogous to adjusting for an estimated prognostic score.
Do no harm: calibrated regression adjustment never increases asymptotic variance. If corr(Y(z), Æ(z)) = 0 the design effect equals 1, so you recover the unadjusted estimator.
Simple implementation: run OLS of outcome on treatment indicator and mean-centered predictions (optionally include treatment × prediction interaction). Use heteroskedasticity-robust SEs. The interacted specification (Lin-style) removes small O(1/n) finite-sample bias.
Efficiency gain formula (intuition): design effect in arm z ≈ 1 − ρ(z,z)^2, where ρ is the correlation between the outcome and the AI prediction in that arm. Gains depend directly on prediction quality.
Comparisons:
- Model-assisted / uncalibrated PPI: can be unbiased but can inflate variance unless predictions are strong (requires high correlation threshold, e.g., ρ > 0.5 in some cases).
- Digital-twin / imputation estimator (imputing missing potential outcomes with AI) is biased unless AI predictions are correct on average — risky in practice.
- Regression adjustment is equivalent (in randomized experiments) to a calibrated AIPW / prognostic-score ANCOVA and to certain CUPED/CALM calibrated variants.
Practical tips for discrete outcomes: use continuous predicted probabilities (extract log-probabilities from token distributions, average across multiple stochastic draws) rather than discretized outputs.
LLMs are most valuable when predictive signal lies in unstructured or textual inputs; with purely structured tabular features, traditional methods (ridge, XGBoost) can match or beat LLM predictions. Combining LLM-derived features and tabular covariates often performs best.
Empirical results: simulations and three applications (survey mega-study with digital twins, email marketing A/B test, large-scale platform experiment) show modest but real efficiency gains; gains larger when unstructured text is present. The do-no-harm property holds empirically.

Data & Methods

Framework: finite-population randomized experiment (SATE target), allow predictions Æ_i(z) for both potential outcomes (may depend on z). Randomness only from random assignment Z.
Estimators discussed:
- Unadjusted difference-in-means (baseline).
- Model-assisted bias-corrected estimator (control-variates / PPI style): unbiased but can inflate variance when predictions are weak.
- Calibrated regression adjustment: estimate within-arm slopes β̂_z (OLS of Y on Æ(z) within arm z) or fit a single OLS Y ~ 1 + Z + Æ(Z) (+ Z×Æ(Z)). This yields the calibrated estimator with guaranteed weak variance reduction.
- Digital-twin (imputation) estimator: impute missing potential outcome by AI predictions — biased unless Æ is correct on average.
- Equivalences: calibrated regression = AIPW with Æ as outcome model = ANCOVA on estimated prognostic score.
Inference: use heteroskedasticity-robust SEs; Lin-style interacted regression reduces finite-sample bias.
Handling discrete/counted outcomes: calibrate continuous predicted probabilities; obtain probabilities via token log-probs, or average multiple model samples to get smoother scores; prefer continuous calibration over thresholding.
Feature engineering: LLMs can featurize unstructured inputs (summaries, embeddings, categorical labels) to augment traditional covariates; include these features in the regression adjustment.
Empirical evaluation: synthetic simulations plus three real-world applications (survey mega-study with digital twins, an email A/B test with tabular covariates, and a large-scale platform experiment). Across these, regression adjustment with AI predictions produced modest efficiency gains and preserved unbiasedness.

Implications for AI Economics

Practical recommendation for experimenters: routinely include AI-generated predictions (and LLM-derived features for text) as covariates in regression adjustment for randomized experiments. Low implementation cost, minimal risk (do-no-harm property), and measurable efficiency gains justify adoption as standard practice.
Efficiency & cost tradeoffs: modest variance reductions translate into smaller required sample sizes or higher power for the same sample budget. This is economically valuable for firms/researchers running many A/B tests or costly field trials — especially when experiments involve textual or other unstructured data that LLMs can exploit.
When to expect large gains: settings where treatment-relevant signal is contained in rich unstructured inputs (messages, open-ended survey text, user-generated content). In fully structured tabular domains, gains may be limited and traditional ML can suffice.
Design and policy applications: improved precision enables more granular heterogeneity analysis, faster iteration for product design, and more efficient allocation of testing budgets. LLM featurization can also facilitate stratification and more informed adaptive designs.
Cautions and limits:
- Prediction quality matters: poor predictions yield limited gains; uncalibrated uses (digital twins as substitutes, uncalibrated PPI) risk bias or inflated variance.
- Operational costs and stability: generating large-scale LLM predictions incurs compute costs and introduces dependencies on model versions (drift, reproducibility). These economic costs should be weighed against sample-size savings.
- Not a substitute for real data: while AI helps efficiency, substituting synthetic respondents for human data without correction risks bias; the recommended practice is hybrid adjustment, not replacement.
Research directions for AI economics: quantify the return-on-investment of model-assisted adjustment across domains (compute cost vs. sample-size savings), study model update / drift effects on long-running test portfolios, and integrate AI-featurization into optimal experimental design and stratification policies.

Assessment

Paper Typetheoretical Evidence Strengthmedium — The paper provides theoretical justification (a formal 'do no harm' property), simulations, and three real-world randomized applications (survey mega-study, an email marketing A/B test, and a large-scale platform experiment), which together support the claim that LLM predictions can yield modest efficiency gains; however the empirical demonstrations show modest and context-dependent gains and do not establish broad, domain-general effect sizes. Methods Rigorhigh — The proposed approach is grounded in standard causal inference (randomized experiments and regression adjustment) with clear theoretical guarantees, complemented by simulations and multiple empirical applications; the paper also addresses practical implementation details (transforming discrete LLM outputs, featurizing unstructured inputs) and contrasts with alternative methods. SampleMultiple settings: simulation studies and three empirical randomized experiments — a survey mega-study (many survey items/participants), an email marketing A/B test (customers/recipients), and a large-scale technology platform experiment (platform users); each application uses available structured and unstructured (text) auxiliary data to generate LLM-based prognostic scores and evaluates variance reduction in treatment effect estimates. Themesproductivity human_ai_collab IdentificationUses randomized assignment as the source of causal identification (standard difference-in-means); proposes inclusion of AI-generated prognostic scores as covariates in regression adjustment to reduce variance while relying on randomization for unbiasedness (the adjusted estimator reverts to unadjusted difference-in-means when predictions are uninformative). GeneralizabilityEffectiveness depends on availability and predictive quality of unstructured/auxiliary data (text, metadata); little benefit when such inputs are sparse., Empirical gains reported are context-specific (survey, marketing, platform) and may not generalize to domains with different text types, languages, or user behavior., Approach presumes randomized assignment; results do not directly extend to purely observational causal settings without additional identification assumptions., Privacy, proprietary data, and reproducibility concerns when using third-party LLMs may limit practical adoption., Magnitude of efficiency gains depends on LLM accuracy and model updates; changing LLMs or prompt engineering could alter performance.

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Generative AI and large language models can produce realistic predictions of human behavior from rich, unstructured inputs with little to no task-specific training data. Research Productivity	positive	high	accuracy/realism of AI-generated predictions of human behavior	0.12
AI-generated predictions can be used to reduce variance in randomized experiments by including them as covariates in regression adjustment. Research Productivity	positive	high	variance of the randomized experiment estimator / statistical efficiency	0.12
No new estimators are required: researchers can simply include AI predictions as covariates in standard regression adjustment, analogous to adjusting for a prognostic score. Research Productivity	positive	high	feasibility of using standard regression adjustment (methodological correctness)	0.2
Including AI predictions as covariates has a 'do no harm' property: the adjusted estimator reverts to the unadjusted difference in means when predictions are uninformative. Research Productivity	null_result	high	bias/consistency and non-worsening of estimator when predictions uninformative	0.2
Other methods, such as variants of prediction-powered inference, do not have the 'do no harm' guarantee. Research Productivity	negative	high	presence or absence of guarantee that adjustment does not worsen estimator when predictions uninformative	0.12
The paper provides implementation guidance, including how to obtain continuous scores from discrete LLM outputs and how to use LLMs to featurize unstructured inputs as auxiliary covariates. Research Productivity	positive	high	availability of practical implementation procedures	0.06
The ideas are demonstrated in simulations and three empirical applications: a survey mega-study, an email marketing A/B test, and a large-scale technology platform experiment. Research Productivity	positive	high	empirical performance of the adjustment approach across simulated and real experiments	0.12
Overall, efficiency gains from adjusting for AI-generated predictions are real but modest, with greater benefits in studies that contain substantial text and other unstructured data. Research Productivity	positive	high	magnitude of efficiency gains (reduction in estimator variance)	0.12
The 'do no harm' property is confirmed empirically. Research Productivity	null_result	high	empirical verification that adjusted estimator does not worsen performance when predictions uninformative	0.12
Given modest efficiency gains and limited costs, the authors recommend adjusting for AI-generated predictions as a regular empirical practice. Research Productivity	positive	high	recommended empirical practice adoption (guidance)	0.02

Adding LLM-generated predictions to regression adjustment modestly sharpens randomized-experiment estimates and will not worsen unbiased estimates, with the biggest gains when rich text or unstructured data are available.