Adding LLM-generated predictions to regression adjustment modestly sharpens randomized-experiment estimates and will not worsen unbiased estimates, with the biggest gains when rich text or unstructured data are available.
Generative AI and large language models can produce realistic predictions of human behavior from rich, unstructured inputs with little to no task-specific training data. Recent work uses these ``digital twin'' predictions to supplement human responses in surveys and experiments. We study the special case of using AI-generated predictions to reduce variance in randomized experiments. We argue that doing so requires no new estimators and that researchers can simply include AI predictions as covariates in standard regression adjustment, analogous to adjusting for a prognostic score. A benefit of this approach is a ``do no harm'' property whereby the adjusted estimator reverts to the unadjusted difference in means when predictions are uninformative. Other methods, such as variants of prediction-powered inference, do not have this guarantee. We provide implementation guidance, including how to obtain continuous scores from discrete LLM outputs and how to use LLMs to featurize unstructured inputs as auxiliary covariates. We demonstrate these ideas in simulations and three empirical applications: a survey mega-study, an email marketing A/B test, and a large-scale technology platform experiment. Overall, efficiency gains are real if modest, with greater benefits in studies that contain substantial text and other unstructured data. We also confirm the do no harm property empirically. Given these gains and limited costs, we recommend adjusting for AI-generated predictions as a regular empirical practice.
Summary
Main Finding
Including AI-generated predictions (e.g., LLM “digital twin” outputs or other black-box model forecasts) as covariates in standard regression adjustment reliably reduces variance in randomized experiments and carries a practical “do no harm” guarantee: when predictions are uninformative the adjusted estimator reverts to the unadjusted difference-in-means. No new estimators are required — calibrated regression adjustment (equivalently, adjustment for an estimated prognostic score or a linear-calibrated form of PPI/AIPW) gives unbiased (or asymptotically unbiased) treatment effect estimates with weakly smaller variance.
Key Points
- Core idea: treat AI predictions Æ_i(z) as auxiliary covariates and include them in regression adjustment. This is analogous to adjusting for an estimated prognostic score.
- Do no harm: calibrated regression adjustment never increases asymptotic variance. If corr(Y(z), Æ(z)) = 0 the design effect equals 1, so you recover the unadjusted estimator.
- Simple implementation: run OLS of outcome on treatment indicator and mean-centered predictions (optionally include treatment × prediction interaction). Use heteroskedasticity-robust SEs. The interacted specification (Lin-style) removes small O(1/n) finite-sample bias.
- Efficiency gain formula (intuition): design effect in arm z ≈ 1 − ρ(z,z)^2, where ρ is the correlation between the outcome and the AI prediction in that arm. Gains depend directly on prediction quality.
- Comparisons:
- Model-assisted / uncalibrated PPI: can be unbiased but can inflate variance unless predictions are strong (requires high correlation threshold, e.g., ρ > 0.5 in some cases).
- Digital-twin / imputation estimator (imputing missing potential outcomes with AI) is biased unless AI predictions are correct on average — risky in practice.
- Regression adjustment is equivalent (in randomized experiments) to a calibrated AIPW / prognostic-score ANCOVA and to certain CUPED/CALM calibrated variants.
- Practical tips for discrete outcomes: use continuous predicted probabilities (extract log-probabilities from token distributions, average across multiple stochastic draws) rather than discretized outputs.
- LLMs are most valuable when predictive signal lies in unstructured or textual inputs; with purely structured tabular features, traditional methods (ridge, XGBoost) can match or beat LLM predictions. Combining LLM-derived features and tabular covariates often performs best.
- Empirical results: simulations and three applications (survey mega-study with digital twins, email marketing A/B test, large-scale platform experiment) show modest but real efficiency gains; gains larger when unstructured text is present. The do-no-harm property holds empirically.
Data & Methods
- Framework: finite-population randomized experiment (SATE target), allow predictions Æ_i(z) for both potential outcomes (may depend on z). Randomness only from random assignment Z.
- Estimators discussed:
- Unadjusted difference-in-means (baseline).
- Model-assisted bias-corrected estimator (control-variates / PPI style): unbiased but can inflate variance when predictions are weak.
- Calibrated regression adjustment: estimate within-arm slopes β̂_z (OLS of Y on Æ(z) within arm z) or fit a single OLS Y ~ 1 + Z + Æ(Z) (+ Zׯ(Z)). This yields the calibrated estimator with guaranteed weak variance reduction.
- Digital-twin (imputation) estimator: impute missing potential outcome by AI predictions — biased unless Æ is correct on average.
- Equivalences: calibrated regression = AIPW with Æ as outcome model = ANCOVA on estimated prognostic score.
- Inference: use heteroskedasticity-robust SEs; Lin-style interacted regression reduces finite-sample bias.
- Handling discrete/counted outcomes: calibrate continuous predicted probabilities; obtain probabilities via token log-probs, or average multiple model samples to get smoother scores; prefer continuous calibration over thresholding.
- Feature engineering: LLMs can featurize unstructured inputs (summaries, embeddings, categorical labels) to augment traditional covariates; include these features in the regression adjustment.
- Empirical evaluation: synthetic simulations plus three real-world applications (survey mega-study with digital twins, an email A/B test with tabular covariates, and a large-scale platform experiment). Across these, regression adjustment with AI predictions produced modest efficiency gains and preserved unbiasedness.
Implications for AI Economics
- Practical recommendation for experimenters: routinely include AI-generated predictions (and LLM-derived features for text) as covariates in regression adjustment for randomized experiments. Low implementation cost, minimal risk (do-no-harm property), and measurable efficiency gains justify adoption as standard practice.
- Efficiency & cost tradeoffs: modest variance reductions translate into smaller required sample sizes or higher power for the same sample budget. This is economically valuable for firms/researchers running many A/B tests or costly field trials — especially when experiments involve textual or other unstructured data that LLMs can exploit.
- When to expect large gains: settings where treatment-relevant signal is contained in rich unstructured inputs (messages, open-ended survey text, user-generated content). In fully structured tabular domains, gains may be limited and traditional ML can suffice.
- Design and policy applications: improved precision enables more granular heterogeneity analysis, faster iteration for product design, and more efficient allocation of testing budgets. LLM featurization can also facilitate stratification and more informed adaptive designs.
- Cautions and limits:
- Prediction quality matters: poor predictions yield limited gains; uncalibrated uses (digital twins as substitutes, uncalibrated PPI) risk bias or inflated variance.
- Operational costs and stability: generating large-scale LLM predictions incurs compute costs and introduces dependencies on model versions (drift, reproducibility). These economic costs should be weighed against sample-size savings.
- Not a substitute for real data: while AI helps efficiency, substituting synthetic respondents for human data without correction risks bias; the recommended practice is hybrid adjustment, not replacement.
- Research directions for AI economics: quantify the return-on-investment of model-assisted adjustment across domains (compute cost vs. sample-size savings), study model update / drift effects on long-running test portfolios, and integrate AI-featurization into optimal experimental design and stratification policies.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Generative AI and large language models can produce realistic predictions of human behavior from rich, unstructured inputs with little to no task-specific training data. Research Productivity | positive | high | accuracy/realism of AI-generated predictions of human behavior |
0.12
|
| AI-generated predictions can be used to reduce variance in randomized experiments by including them as covariates in regression adjustment. Research Productivity | positive | high | variance of the randomized experiment estimator / statistical efficiency |
0.12
|
| No new estimators are required: researchers can simply include AI predictions as covariates in standard regression adjustment, analogous to adjusting for a prognostic score. Research Productivity | positive | high | feasibility of using standard regression adjustment (methodological correctness) |
0.2
|
| Including AI predictions as covariates has a 'do no harm' property: the adjusted estimator reverts to the unadjusted difference in means when predictions are uninformative. Research Productivity | null_result | high | bias/consistency and non-worsening of estimator when predictions uninformative |
0.2
|
| Other methods, such as variants of prediction-powered inference, do not have the 'do no harm' guarantee. Research Productivity | negative | high | presence or absence of guarantee that adjustment does not worsen estimator when predictions uninformative |
0.12
|
| The paper provides implementation guidance, including how to obtain continuous scores from discrete LLM outputs and how to use LLMs to featurize unstructured inputs as auxiliary covariates. Research Productivity | positive | high | availability of practical implementation procedures |
0.06
|
| The ideas are demonstrated in simulations and three empirical applications: a survey mega-study, an email marketing A/B test, and a large-scale technology platform experiment. Research Productivity | positive | high | empirical performance of the adjustment approach across simulated and real experiments |
0.12
|
| Overall, efficiency gains from adjusting for AI-generated predictions are real but modest, with greater benefits in studies that contain substantial text and other unstructured data. Research Productivity | positive | high | magnitude of efficiency gains (reduction in estimator variance) |
0.12
|
| The 'do no harm' property is confirmed empirically. Research Productivity | null_result | high | empirical verification that adjusted estimator does not worsen performance when predictions uninformative |
0.12
|
| Given modest efficiency gains and limited costs, the authors recommend adjusting for AI-generated predictions as a regular empirical practice. Research Productivity | positive | high | recommended empirical practice adoption (guidance) |
0.02
|