The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Large language models can unlock large-scale text-based economic research, but only if researchers guard against training-data leakage for prediction and use an independent validation sample to correct LLM measurement errors or risk biased and imprecise estimates.

Large Language Models: An Applied Econometric Framework
Jens Ludwig, Sendhil Mullainathan, Ashesh Rambachan · April 06, 2026 · Annual Review of Economics
openalex theoretical n/a evidence 8/10 relevance DOI Source PDF
The paper provides an econometric framework showing that valid use of LLMs for prediction requires preventing training-data leakage, while valid estimation of economic concepts requires combining LLM outputs with a small validation sample to correct measurement error and enable proper inference.

Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost. Researchers can now revisit old questions and tackle novel ones with rich data. We provide an econometric framework for realizing this potential in two empirical uses. For prediction problems—forecasting outcomes from text—valid conclusions require “no training leakage” between the LLM's training data and the researcher's sample, which can be enforced through careful model choice and research design. For estimation problems—automating the measurement of economic concepts for downstream analysis—valid downstream inference requires combining LLM outputs with a small validation sample to deliver consistent and precise estimates. Absent a validation sample, researchers cannot assess possible errors in LLM outputs, and consequently seemingly innocuous choices (which model, which prompt) can produce dramatically different parameter estimates. When used appropriately, LLMs are powerful tools that can expand the frontier of empirical economics.

Summary

Main Finding

The paper develops an applied-econometric framework for using large language models (LLMs) in two distinct empirical roles — prediction and estimation — and gives clear, practical requirements for valid inference. For prediction tasks, valid out-of-sample performance requires enforcing “no training leakage” (no overlap between the LLM’s training data and the researcher’s evaluation sample). For estimation tasks (using LLMs to measure concepts from text for downstream analysis), valid downstream inference requires coupling LLM outputs with a (typically small) validation sample and using debiasing corrections; without validation data, researchers cannot assess or correct LLM errors and downstream parameter estimates can vary dramatically with innocuous choices (model, prompt).

Key Points

  • Two empirical uses of LLMs:
    • Prediction: use text (via LLM predictions or embeddings) to forecast outcomes (e.g., returns).
    • Estimation: use LLMs to measure economic concepts from text (labels) for downstream causal/statistical analysis.
  • Central requirements:
    • Prediction → enforce no training leakage between the LLM’s training corpus and the researcher’s evaluation data.
    • Estimation → collect a validation (labeled) sample and debias LLM outputs; this preserves consistency and asymptotic normality for downstream estimators and can improve precision beyond using the validation set alone.
  • LLMs are treated as black boxes: define an LLM as a mapping from training dataset t to a text generator cm(·; t). This captures prompt engineering, sampling parameters, reasoning-time computation, RLHF, etc.
  • Benchmarks are of limited use for task-specific inference because LLM performance is brittle (the “jagged frontier”): small changes in inputs, formatting, or prompt can cause large, unpredictable changes in outputs. Empirical and theoretical evidence suggests LLMs often do not learn reliable, generalizable “world models.”
  • Absent validation, downstream estimates are not identifiable as a function of LLM outputs: seemingly minor choices (which model, which prompt) can flip coefficient magnitudes, signs and significance in applied settings (finance, political economy).
  • Practical design choices to avoid leakage: use models with published/fixed weights or documented training cutoffs; evaluate on documents published after the model cutoff; prefer open-source/time-stamped models when feasible.
  • The framework generalizes to other uses: hypothesis generation is a prediction problem (so needs leakage checks); simulating survey/experimental subjects is an estimation problem (so requires validation).

Data & Methods

  • Formal setup:
    • Universe of strings Σ*, researcher-relevant subset R; researcher dataset indicated by d; model training dataset indicated by t.
    • Each text piece r ∈ R links to observable variables (Yr, Wr) and an ideal (gold-standard) measurement Vr = f*(r) that a human/expert would produce if scalable.
    • LLM abstraction: the training algorithm maps t → text generator cm(·; t). Researchers interact with cm via prompts.
  • Prediction analysis:
    • Valid prediction evaluation requires no overlap between t and the researcher’s evaluation sample. The paper discusses design rules to ensure this (time-based splits, model selection).
  • Estimation analysis:
    • Show how to combine imperfect LLM labels ˆVr = cm(r; t) with a labeled validation sample to construct debiased estimators in standard downstream procedures (illustrated in linear regression). Debiasing builds on classical measurement-error and semiparametric correction literature (Lee & Sepanski 1995; Chen, Hong & Tamer 2005; Schennach 2016) and modern ML/causal-debiasing work (e.g., Wang et al. 2020; Angelopoulos et al. 2023; Egami et al. 2024).
    • Debiased LLM-augmented estimators can be consistent, asymptotically normal, and often more efficient than using the validation sample alone.
  • Empirical illustrations:
    • The paper demonstrates empirically that without validation samples LLM-based labeling decisions (model, prompt) produce widely varying parameter estimates in applications to finance and political economy (coefficient magnitude, sign, significance).
  • Methodological stance:
    • Black-box approach: do not require modeling internal LLM mechanisms; require checkable conditions (leakage, validation) for applied work.
    • The analysis allows deterministic and extends to stochastic generators (notationally more involved).

Implications for AI Economics

  • For applied researchers:
    • Always check and enforce no training leakage when using LLMs for prediction tasks (document model training cutoff or use models with fixed/public weights; construct evaluation samples post-cutoff).
    • When using LLMs to measure text-based concepts for downstream inference, collect and report a labeled validation sample and apply debiasing corrections — LLMs can amplify validation data but cannot substitute for it.
    • Report model identity, training cutoff (if known), prompt text, generation parameters, and validation procedures to ensure reproducibility and enable assessment of leakage/robustness.
  • For study design and policy:
    • Prefer open-source or time-stamped models where provenance is known; lack of transparency in commercial models can threaten validity of prediction tasks.
    • Pre-registration and explicit reporting of validation/sample-splitting rules become more important given LLM brittleness.
  • For future research in AI economics:
    • Develop practical leakage-detection tools and standards for certifying model training cutoffs.
    • Extend debiasing and validation methods to non-linear downstream estimands, complex causal designs, and high-dimensional settings.
    • Study cost–precision tradeoffs: optimal allocation between validation labeling effort and reliance on LLM outputs.
    • Investigate protocol standards (benchmarks, validation sets) tailored to economic text tasks to improve comparability across studies.
  • Broader takeaway: LLMs can substantially expand empirical work by scaling text processing, but valid inference requires simple, enforceable design rules (no leakage for prediction; validation+debiasing for estimation). When used appropriately, LLMs act as amplifiers of human validation, not as drop-in replacements.

Assessment

Paper Typetheoretical Evidence Strengthn/a — This is a methodological/econometric framework rather than an empirical study reporting new causal estimates; it provides formal arguments and recommended procedures rather than evidence from new data. Methods Rigorhigh — The paper presents a clear, formal econometric discussion of identification threats (training leakage) and of inference under measurement error using validation samples, and it prescribes concrete design and estimation strategies for researchers to obtain consistent and precise estimates. SampleNo original empirical sample reported; the paper is conceptual/methodological, offering a framework applicable to researchers working with text and LLM outputs (may include illustrative or simulated examples but no primary dataset). Themesinnovation human_ai_collab IdentificationFor prediction: prevent 'training leakage' by ensuring the LLM has not seen the research sample (choose models/timeframes or hold-out data accordingly). For estimation/measurement: achieve valid downstream inference by combining LLM-generated labels with a small, independent validation (gold) sample and using that sample to correct/adjust for measurement error and quantify uncertainty. GeneralizabilityFramework applies only to text-based analyses where LLMs are used; not directly applicable to non-text data modalities., Practical implementation depends on researchers' ability to determine or guarantee absence of LLM training overlap, which is often infeasible for proprietary/closed models., Requires availability of a (possibly small) independent validation sample; small-sample variability or labeling costs may limit applicability in some settings., Does not address downstream issues from model updates, distributional shift, or domain-specific LLM failures without additional checks., Recommendations assume researchers can access multiple model choices or control prompts/model versions; constraints (cost, API access) can limit adherence.

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost. Research Productivity positive high ability to analyze text at scale and cost
0.12
Researchers can now revisit old questions and tackle novel ones with rich data using LLMs. Research Productivity positive high ability to (re)address research questions using textual data
0.12
The paper provides an econometric framework for realizing the potential of LLMs in two empirical uses: prediction problems and estimation problems. Research Productivity positive high methodological framework for empirical use of LLMs
0.12
For prediction problems—forecasting outcomes from text—valid conclusions require 'no training leakage' between the LLM's training data and the researcher's sample, which can be enforced through careful model choice and research design. Output Quality negative high validity of predictive conclusions from text
0.12
For estimation problems—automating the measurement of economic concepts for downstream analysis—valid downstream inference requires combining LLM outputs with a small validation sample to deliver consistent and precise estimates. Output Quality positive high consistency and precision of downstream estimates derived from LLM-measured variables
0.12
Absent a validation sample, researchers cannot assess possible errors in LLM outputs, and consequently seemingly innocuous choices (which model, which prompt) can produce dramatically different parameter estimates. Error Rate negative high robustness/error in parameter estimates derived from LLM outputs
0.12
When used appropriately, LLMs are powerful tools that can expand the frontier of empirical economics. Research Productivity positive high expansion of empirical economics research capabilities
0.02

Notes