Synthetic data can support valid scientific inference when researchers can identify historical tasks that are exchangeable with the current task; the paper formalizes 'task exchangeability', gives provable correction methods, and demonstrates them on survey silicon samples and LLM autoraters.
There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.
Summary
Main Finding
The paper introduces a principled framework—“inference via task exchangeability”—that gives provably valid confidence intervals when researchers must rely on wholly synthetic datasets (e.g., LLM-generated “silicon samples”, autorater outputs, or generative-model biological samples). The core idea is to use historical tasks for which both real and synthetic data are available to learn the distribution of the real–synthetic estimation gap, and then expand the naive synthetic-data confidence interval for a new target task by a data-driven bound on that gap. Under a formal task-exchangeability assumption the resulting interval attains guaranteed coverage: P(θ* in CI) ≥ 1 − (α1 + α2 + α3). The paper also gives extensions that handle graded departures from exchangeability, multidimensional targets, finite-sample targets, and partial availability of real data in the target task.
Key Points
-
Problem setup
- Target: infer a functional θ(P) (mean, median, regression coefficient, etc.) for a task T* where no real data is available.
- Available: a synthetic generator G (e.g., an LLM) that can produce synthetic samples ˜S ∼ ˜P = G(T*).
- Historical tasks T1,...,TT each have both real data Sj ∼ Pj and synthetic samples ˜Sj ∼ ˜Pj = G(Tj).
-
Task exchangeability (core assumption)
- The sequence of task–dataset pairs (T1,S1),...,(TT+1,ST+1) (with TT+1 = T*) is exchangeable (i.e., drawn i.i.d. from a meta-distribution, or at least exchangeable).
- Intuition: the distribution of the real–synthetic gaps across historical tasks is representative of the gap on the new task up to exchangeability.
-
Main method (Algorithm sketch)
- From synthetic data for the target task compute a standard synthetic-data CI for ˜θ (e.g., via CLT/bootstrap): [˜L,˜U] with nominal 1−α1 coverage for ˜θ.
- For each historical task j compute a CI for the real–synthetic gap ∆j = θj(Pj) − θj(˜Pj) using Sj and ˜Sj: [ˆ∆Lj, ˆ∆Uj] at level 1−α2.
- Take empirical order-statistic bounds of the T observed interval endpoints to form [ˆ∆L, ˆ∆U] that (by exchangeability and finite-sample correction) cover the current task gap with probability ≥ 1−α2−α3.
-
Output expanded interval CI = [˜L + ˆ∆L, ˜U + ˆ∆U]. By union bounds, coverage is ≥ 1 − (α1 + α2 + α3).
-
Formal guarantee
- Theorem (informal): Under task exchangeability and access to valid CI constructions for single-distribution targets and distribution differences, Algorithm 1 yields P(θ* ∈ CI) ≥ 1 − (α1 + α2 + α3).
-
Practical remarks
- Because synthetic data is usually cheap to generate, α1 can be made very small (generate large N), so most error budget is used for calibrating bias via α2 and α3.
- A natural point estimator is bias-corrected synthetic estimate: ˆθ = θ*(˜S) + (1/T)∑_j (θj(Sj) − θj(˜Sj)), which is unbiased (for the finite-sample target) under exchangeability.
- The method is connected to prediction-powered inference, conformal prediction (exchangeability → uniform rank), weighted conformal variants, and empirical Bayes ideas.
-
Extensions
- Weighted procedure: historical tasks can be reweighted by relevance; departures from exchangeability are handled via explicit total-variation penalties in coverage bounds (graceful degradation).
- Finite-sample target inference (infer θ(S) rather than population θ(P*)) has simpler guarantees and less conservatism.
- When some real data from the target task is available, the framework adapts and connects to prior synthetic-data inference work (e.g., guardrail approaches).
-
Limitations and cautions
- Requires a set of historical tasks that are plausibly exchangeable / relevant to the new task; poor choice undermines validity or leads to conservative intervals.
- Coverage guarantee is frequentist and depends on correct application of underlying CI procedures for differences and single distributions.
- May be conservative (interval inflation) when historical tasks are only weakly informative.
- Not a cure-all for causal identification problems or external validity gaps beyond what the historical tasks can capture.
Data & Methods
-
Formal model
- Task T = (θ, P) where θ(P) is the estimand. For target task T = (θ, P), synthetic generator gives ˜P = G(T); synthetic sample ˜S ∼ ˜P.
- Historical tasks Tj = (θj, Pj) have both real Sj ∼ Pj and synthetic ˜Sj ∼ ˜Pj = G(Tj).
- Real–synthetic gap per task: ∆j = θj(Pj) − θj(˜Pj). The unknown target gap is ∆T+1.
-
Assumptions
- Assumption 1: Task exchangeability — the sequence (T1,S1),...,(TT+1,ST+1) is exchangeable.
- Assumption 2: Access to valid CI procedures:
- CIθ,α(S): 1−α CI for θ(P) from data S.
- ∆θ,α(S,S′): 1−α CI for θ(P) − θ(P′) from datasets S and S′ (used to bound gaps).
-
Algorithm details (Algorithm 1 in paper)
- Draw synthetic samples for the current task and each historical task.
- Construct synthetic-data CI [˜L,˜U] for ˜θ at level 1−α1.
- For each historical j compute gap CI [ˆ∆Lj, ˆ∆Uj] at level 1−α2.
- Choose order-statistic indices kL, kU based on α3 and T; set ˆ∆L = ˆ∆L(kL), ˆ∆U = ˆ∆U(kU).
- Return CI = [˜L + ˆ∆L, ˜U + ˆ∆U].
-
Finite-sample correction and probability arguments
- Exchangeability ensures the unobserved current-task gap interval has uniform rank among the T+1 intervals, which motivates using empirical quantiles with finite-sample correction controlled by α3.
- Union bounds combine coverage contributions from the synthetic CI and the gap CI quantile event to give total error ≤ α1 + α2 + α3.
-
Empirical demonstrations
- Example: ANES feeling-thermometer scores with GPT-generated silicon samples. Naive synthetic-only CIs were too narrow/misleading; task-exchangeability intervals were wider and achieved the target coverage (illustrates practical need for debiasing via historical tasks).
- Mentioned applications include LLM-generated “silicon samples” for surveys, LLM-as-judge AI evaluations, and synthetic protein structures in proteomics.
Implications for AI Economics
-
Enabling cheaper, faster exploratory and pilot studies
- AI economists often face costly surveys or hard-to-collect outcome data. The method allows principled use of synthetic respondents or imputed outcomes in early-stage studies while still obtaining valid uncertainty quantification—provided suitable historical task pairs exist.
-
Rigorous calibration of AI-generated measurements
- When using LLMs to simulate respondents, rate outputs, or act as evaluators (“autoraters”), researchers can quantify and correct for systematic biases learned from prior, related tasks rather than naively treating synthetic outputs as ground truth.
-
Policy and program evaluation
- For policy-relevant estimands (means, elasticities, heterogeneous effects), the framework can help bound uncertainty when synthetic data substitute for scarce real data; but economists should carefully curate historical tasks that reflect policy contexts to justify exchangeability.
-
Design recommendations for AI-economics research pipelines
- Maintain and publish repositories of paired real & synthetic datasets across multiple related tasks (e.g., many survey questions, demographic subgroups, evaluation benchmarks). Such repositories are exactly what the method needs to learn realistic gap distributions and produce trustworthy inference.
- Allocate error budget sensibly: make α1 very small (generate many synthetic samples) and use the remaining budget to control gap estimation (α2, α3).
- Use relevance-weighting (and sensitivity analyses) when historical tasks differ in obvious ways from the target—apply the weighted extension and report how coverage degrades with plausible departures from exchangeability.
-
Limits for causal inference & external validity
- The approach does not replace the need for causal identification; synthetic data cannot invent unobserved confounder information. For causal parameters, one must ensure the historical tasks capture the same bias patterns in synthetic generation as the causal target, or obtain at least partial real data from the target task and use the adapted methods discussed in the paper.
-
Research agenda and policy implications
- Incentivize creation of standard benchmark collections of task pairs (real, synthetic) so applied economists can reliably use synthetic agents for rapid hypothesis testing and large-scale counterfactual exploration without sacrificing inferential validity.
- Encourage reporting of the set of historical tasks, weighting schemes, and sensitivity to exchangeability assumptions when publishing results that rely on synthetic data.
Summary takeaway: Task exchangeability provides a clear, implementable route to get valid frequentist inference when relying on synthetic data—by learning how synthetic outputs typically differ from real ones across related tasks and inflating synthetic-only intervals accordingly. For AI economics, this gives a formal tool to harness LLMs and other generative models for scalable empirical work, provided careful construction and documentation of relevant historical task pairs and explicit robustness checks for exchangeability.
Assessment
Claims (6)
| Claim | Direction | Outcome | Confidence & Evidence | Details |
|---|---|---|---|---|
| Synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. Research Productivity | positive | rate of scientific discovery / research throughput (ability to run more studies and ask more questions) |
Reading fidelity
high
Study strength
speculative
|
|
| Synthetic data can be biased, noisy, and misspecified. Output Quality | negative | quality/validity of synthetic data (bias, noise, misspecification) |
Reading fidelity
high
Study strength
medium
|
|
| We propose statistical principles for using synthetic data in scientific research with provable validity guarantees. Decision Quality | positive | validity of inference when using synthetic data |
Reading fidelity
high
Study strength
high
|
|
| We introduce a new technical condition called task exchangeability: the researcher can identify historical tasks with real data such that the current task is exchangeable with the historical tasks in an appropriate mathematical sense. Decision Quality | positive | applicability of statistical guarantees (valid inference) under identified condition |
Reading fidelity
high
Study strength
high
|
|
| We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. Decision Quality | positive | valid inference (statistical validity) when using synthetic data |
Reading fidelity
high
Study strength
high
|
|
| We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters. Adoption Rate | positive | application of proposed methods to real tasks (public opinion silicon samples; AI autorater evaluation) |
Reading fidelity
high
Study strength
medium
|