A new estimation framework (GAI) uses LLM outputs as auxiliary information to shrink human-label needs and tighten inference: theory guarantees weakly better efficiency than human-only estimators and applied tests show up to ~90% reductions in labeling while preserving decision accuracy; gains are largest when AI signals are predictive but the method is a safe default even with weak auxiliary data.
Data-driven operations management often relies on parameters estimated from costly human-generated labels. Recent advances in large language models (LLMs) and other AI systems offer inexpensive auxiliary data, but introduce a new challenge: AI outputs are not direct observations of the target outcomes, but could involve high-dimensional representations with complex and unknown relationships to human labels. Conventional methods leverage AI predictions as direct proxies for true labels, which can be inefficient or unreliable when this relationship is weak or misspecified. We propose Generative Augmented Inference (GAI), a general framework that incorporates AI-generated outputs as informative features for estimating models of human-labeled outcomes. GAI uses an orthogonal moment construction that enables consistent estimation and valid inference with flexible, nonparametric relationship between LLM-generated outputs and human labels. We establish asymptotic normality and show a "safe default" property: relative to human-data-only estimators, GAI weakly improves estimation efficiency under arbitrary auxiliary signals and yields strict gains whenever the auxiliary information is predictive. Empirically, GAI outperforms benchmarks across diverse settings. In conjoint analysis with weak auxiliary signals, GAI reduces estimation error by about 50% and lowers human labeling requirements by over 75%. In retail pricing, where all methods access the same auxiliary inputs, GAI consistently outperforms alternative estimators, highlighting the value of its construction rather than differences in information. In health insurance choice, it cuts labeling requirements by over 90% while maintaining decision accuracy. Across applications, GAI improves confidence interval coverage without inflating width. Overall, GAI provides a principled and scalable approach to integrating AI-generated information.
Summary
Main Finding
Generative Augmented Inference (GAI) is a semiparametric framework that incorporates AI-generated outputs (predictions, embeddings, chain-of-thought text, persona features, etc.) as auxiliary features—not as surrogate labels—to improve estimation and inference for human-labeled outcomes. By embedding these auxiliary signals into a Neyman-orthogonal score (a doubly-robust/AIPW-style construction) and using cross‑fitted flexible ML to estimate nuisance functions, GAI yields √n-consistent, asymptotically normal estimators with a closed-form variance. Importantly, under random labeling GAI is a “safe default”: it weakly dominates human-data-only estimators in efficiency and strictly improves whenever the auxiliary signals are predictive. Empirically, GAI substantially reduces estimation error and human-label requirements across multiple applied settings (conjoint, retail pricing, health insurance) while maintaining valid confidence intervals.
Key Points
- Conceptual shift: treat AI outputs as informative covariates/features (z) rather than noisy substitutes for the target outcome (y). This allows use of structured, high-dimensional, or unstructured AI outputs (embeddings, text, reasoning traces).
- Estimator construction:
- Estimate two nuisance functions on the labeled subset: (i) outcome predictor g(x,z) = E[y | X=x, Z=z], and (ii) propensity e(x,z) = P(label observed | X=x, Z=z).
- Plug these into a Neyman-orthogonal moment (AIPW-like) and solve for target GLM parameters, using cross-fitting to control overfitting.
- Statistical guarantees:
- Asymptotic normality and √n-consistency (Theorem 1).
- Closed-form variance; inference valid under standard double-machine-learning rate conditions for the nuisance estimators.
- “Safe default” property (Corollary 1): if labeling is random (propensity independent of X,Z), GAI never worsens asymptotic efficiency relative to using only human-labeled data, and strictly improves when Z contains predictive information.
- Variance decomposition (Corollary 2): efficiency gains can come from (i) effective sample expansion (using unlabeled units via Z), (ii) representational power (Z helps approximate conditional expectations even if it’s deterministic transform of X), and (iii) extra predictive signal in Z beyond X (digital-twin/persona information).
- Relationship to prior work:
- Differs from Prediction-Powered Inference (PPI): PPI treats AI outputs as surrogate labels and requires them to approximate y; GAI does not. PPI can fail or inflate variance when AI outputs are not surrogates or are stochastic; GAI remains valid.
- Shares algebraic form with AIPW/doubly robust estimators but repurposes terms to leverage AI-generated representations as augmentation rather than mere correction for missingness.
- Practical features:
- Works with discrete labels, continuous scores, high-dimensional embeddings, and unstructured text.
- Allows flexible ML methods for nuisance estimation (random forests, neural nets, boosting) thanks to orthogonality and cross-fitting.
- Empirically robust in varied regimes (weak or strong AI signals; identical auxiliary inputs across methods).
Data & Methods (empirical evidence & implementation)
- General setup: Researchers observe X and AI-generated Z for all units; human labels y are observed for a subset. Target is GLM-style parameter estimation (misspecification allowed).
- Nuisance estimation: estimate g(x,z) and e(x,z) on labeled data using flexible ML, with cross‑fitting to avoid overfitting bias.
- Orthogonal score: combine g and e into a Neyman-orthogonal moment that uses both labeled and unlabeled observations to estimate parameters and variances.
- Three applied evaluations:
- Vaccine conjoint (most demanding): - Z: LLM chain-of-thought embeddings (dim 3072) and discrete LLM choice labels (−1,0,1). - LLM discrete choice accuracy ≈ 54% (near random). - Results: GAI reduces MAPE from 19–32% (benchmarks) to 16–17%; with 50 human labels GAI outperforms primary-only with 200 labels (>75% reduction). Decision error falls from 6.9% to 2.6%. Embeddings give smaller MAPE and better coverage; discrete labels yield narrower CIs.
- Retail pricing with digital twins: - Z: binary digital-twin purchase predictions (systematically biased: 30% predicted vs 44% actual) plus persona signal. - Controlled comparison where all methods have same Z. - Results: GAI reduces MAPE from 10–22% (best benchmark PPI++) to 7–12%; with 100 labels matches primary-only with 300 labels (≈67% reduction). GAI attains 96–100% CI coverage and near-zero decision errors.
- Health insurance choice (US Census-based, favorable ML regime): - Z: ML predictions with ≈85% accuracy (benchmark PPI authors’ setup). - Results: GAI MAPE 140–160% vs 290–980% for primary-only; with 100 labels GAI outperforms primary-only with 1,000 labels (>90% reduction). GAI achieves 99–100% CI coverage and zero decision errors; PPI++ undercovers and has nonzero decision errors.
- Implementation notes:
- Cross-fitting recommended for stability and to satisfy orthogonality assumptions.
- Choose flexible learners for g and e; ensemble or model-selection heuristics can be used.
- Theoretical nuisance-rate requirements are standard for double/DML frameworks (product of errors small enough, e.g., roughly n−1/2 in aggregate).
Implications for AI Economics
- Practical cost reductions: GAI makes it feasible to substantially reduce expensive human labeling in empirical economics and operations (surveys, conjoint studies, expert annotations) by leveraging cheap AI-generated representations while preserving valid inference.
- Safe integration of generative AI: Researchers and practitioners can incorporate outputs of generative models (including stochastic chain-of-thought, persona-conditioned digital twins, and embeddings) even when those outputs are biased or are not direct proxies for outcomes; GAI leverages whatever predictive content is present without requiring calibration of AI outputs into unbiased labels.
- Better use of simulated/digital-twin data: GAI formalizes a principled sim-to-real inferential approach: synthetic/generated observations need not be treated as labels—treating them as auxiliary features extracts representational and predictive power for real-data inference.
- Policy and decision-making: Reduced labeling needs (often 67–90% fewer labels in applications examined) allow faster, cheaper estimation for pricing, demand modeling, health-choice behavior, and other domains where human labels are costly—enabling more frequent updates, larger population coverage, and quicker policy analysis.
- Methodological guidance for economists:
- When AI outputs are available for many units but human labels are scarce, favor feature-augmentation + orthogonal estimation (GAI) over plug-in surrogate approaches (PPI) unless AI outputs are demonstrably unbiased surrogates.
- Ensure adequate modeling and cross-fitting for nuisance functions; verify sensitivity to label selection mechanism—safe-default guarantees rely on random labeling; nonrandom label selection requires careful propensity estimation.
- Cautions & limitations:
- Safe-default property requires random selection into labeling (propensity independent of X,Z). If labeling is systematically selective and propensity is misspecified or poorly estimated, gains may be reduced or inference compromised.
- GAI is developed within a GLM-target framework (though misspecification is allowed); extensions to fully nonparametric targets may require additional work.
- Performance depends on the ability to estimate g(x,z) and e(x,z) at sufficient rates; small labeled samples may limit this in extremely high-dimensional Z without strong regularization or structural assumptions.
- Research directions: applying GAI to policy evaluation, causal inference with generated counterfactuals, dynamic pricing and experimentation where AI simulations are abundant; theoretical extensions for selection-on-observables violations or richer causal settings.
Summary: GAI provides a practical, theory-backed way to use modern generative-AI outputs as auxiliary features for efficient estimation and valid inference. It is especially valuable in AI-economics settings where human labels are costly and AI outputs are structured, high-dimensional, or biased—enabling large reductions in labeling needs and improved decision accuracy while maintaining trustworthy confidence intervals.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| GAI uses an orthogonal moment construction that enables consistent estimation and valid inference with flexible, nonparametric relationship between LLM-generated outputs and human labels. Research Productivity | positive | high | consistent estimation and valid inference (statistical estimation properties) |
0.2
|
| The authors establish asymptotic normality for the GAI estimator and show a 'safe default' property: relative to human-data-only estimators, GAI weakly improves estimation efficiency under arbitrary auxiliary signals and yields strict gains whenever the auxiliary information is predictive. Research Productivity | positive | high | estimation efficiency (asymptotic variance / efficiency relative to baseline) |
0.2
|
| Empirically, GAI outperforms benchmarks across diverse settings. Output Quality | positive | high | overall performance relative to benchmarks (estimation error / predictive performance) |
0.12
|
| In conjoint analysis with weak auxiliary signals, GAI reduces estimation error by about 50% and lowers human labeling requirements by over 75%. Error Rate | positive | high | estimation error; human labeling requirements |
about 50% reduction (estimation error); over 75% reduction (labeling requirements)
0.12
|
| In retail pricing, where all methods access the same auxiliary inputs, GAI consistently outperforms alternative estimators, highlighting the value of its construction rather than differences in information. Firm Revenue | positive | high | estimator performance in retail pricing (e.g., predictive or decision accuracy / estimation error) |
0.12
|
| In health insurance choice, GAI cuts labeling requirements by over 90% while maintaining decision accuracy. Task Allocation | positive | high | human labeling requirements; decision accuracy |
over 90% reduction in labeling requirements
0.12
|
| Across applications, GAI improves confidence interval coverage without inflating width. Decision Quality | positive | high | confidence interval coverage and width (statistical inference quality) |
0.12
|
| Conventional methods that use AI predictions as direct proxies for true labels can be inefficient or unreliable when the relationship between AI outputs and human labels is weak or misspecified. Error Rate | negative | high | efficiency/reliability of estimators using AI outputs as direct proxies |
0.06
|
| Overall, GAI provides a principled and scalable approach to integrating AI-generated information. Organizational Efficiency | positive | high | scalability and principled integration of AI-generated information |
0.12
|