Large language models can unlock large-scale text-based economic research, but only if researchers guard against training-data leakage for prediction and use an independent validation sample to correct LLM measurement errors or risk biased and imprecise estimates.
Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost. Researchers can now revisit old questions and tackle novel ones with rich data. We provide an econometric framework for realizing this potential in two empirical uses. For prediction problems—forecasting outcomes from text—valid conclusions require “no training leakage” between the LLM's training data and the researcher's sample, which can be enforced through careful model choice and research design. For estimation problems—automating the measurement of economic concepts for downstream analysis—valid downstream inference requires combining LLM outputs with a small validation sample to deliver consistent and precise estimates. Absent a validation sample, researchers cannot assess possible errors in LLM outputs, and consequently seemingly innocuous choices (which model, which prompt) can produce dramatically different parameter estimates. When used appropriately, LLMs are powerful tools that can expand the frontier of empirical economics.
Summary
Main Finding
The paper develops an applied-econometric framework for using large language models (LLMs) in two distinct empirical roles — prediction and estimation — and gives clear, practical requirements for valid inference. For prediction tasks, valid out-of-sample performance requires enforcing “no training leakage” (no overlap between the LLM’s training data and the researcher’s evaluation sample). For estimation tasks (using LLMs to measure concepts from text for downstream analysis), valid downstream inference requires coupling LLM outputs with a (typically small) validation sample and using debiasing corrections; without validation data, researchers cannot assess or correct LLM errors and downstream parameter estimates can vary dramatically with innocuous choices (model, prompt).
Key Points
- Two empirical uses of LLMs:
- Prediction: use text (via LLM predictions or embeddings) to forecast outcomes (e.g., returns).
- Estimation: use LLMs to measure economic concepts from text (labels) for downstream causal/statistical analysis.
- Central requirements:
- Prediction → enforce no training leakage between the LLM’s training corpus and the researcher’s evaluation data.
- Estimation → collect a validation (labeled) sample and debias LLM outputs; this preserves consistency and asymptotic normality for downstream estimators and can improve precision beyond using the validation set alone.
- LLMs are treated as black boxes: define an LLM as a mapping from training dataset t to a text generator cm(·; t). This captures prompt engineering, sampling parameters, reasoning-time computation, RLHF, etc.
- Benchmarks are of limited use for task-specific inference because LLM performance is brittle (the “jagged frontier”): small changes in inputs, formatting, or prompt can cause large, unpredictable changes in outputs. Empirical and theoretical evidence suggests LLMs often do not learn reliable, generalizable “world models.”
- Absent validation, downstream estimates are not identifiable as a function of LLM outputs: seemingly minor choices (which model, which prompt) can flip coefficient magnitudes, signs and significance in applied settings (finance, political economy).
- Practical design choices to avoid leakage: use models with published/fixed weights or documented training cutoffs; evaluate on documents published after the model cutoff; prefer open-source/time-stamped models when feasible.
- The framework generalizes to other uses: hypothesis generation is a prediction problem (so needs leakage checks); simulating survey/experimental subjects is an estimation problem (so requires validation).
Data & Methods
- Formal setup:
- Universe of strings Σ*, researcher-relevant subset R; researcher dataset indicated by d; model training dataset indicated by t.
- Each text piece r ∈ R links to observable variables (Yr, Wr) and an ideal (gold-standard) measurement Vr = f*(r) that a human/expert would produce if scalable.
- LLM abstraction: the training algorithm maps t → text generator cm(·; t). Researchers interact with cm via prompts.
- Prediction analysis:
- Valid prediction evaluation requires no overlap between t and the researcher’s evaluation sample. The paper discusses design rules to ensure this (time-based splits, model selection).
- Estimation analysis:
- Show how to combine imperfect LLM labels ˆVr = cm(r; t) with a labeled validation sample to construct debiased estimators in standard downstream procedures (illustrated in linear regression). Debiasing builds on classical measurement-error and semiparametric correction literature (Lee & Sepanski 1995; Chen, Hong & Tamer 2005; Schennach 2016) and modern ML/causal-debiasing work (e.g., Wang et al. 2020; Angelopoulos et al. 2023; Egami et al. 2024).
- Debiased LLM-augmented estimators can be consistent, asymptotically normal, and often more efficient than using the validation sample alone.
- Empirical illustrations:
- The paper demonstrates empirically that without validation samples LLM-based labeling decisions (model, prompt) produce widely varying parameter estimates in applications to finance and political economy (coefficient magnitude, sign, significance).
- Methodological stance:
- Black-box approach: do not require modeling internal LLM mechanisms; require checkable conditions (leakage, validation) for applied work.
- The analysis allows deterministic and extends to stochastic generators (notationally more involved).
Implications for AI Economics
- For applied researchers:
- Always check and enforce no training leakage when using LLMs for prediction tasks (document model training cutoff or use models with fixed/public weights; construct evaluation samples post-cutoff).
- When using LLMs to measure text-based concepts for downstream inference, collect and report a labeled validation sample and apply debiasing corrections — LLMs can amplify validation data but cannot substitute for it.
- Report model identity, training cutoff (if known), prompt text, generation parameters, and validation procedures to ensure reproducibility and enable assessment of leakage/robustness.
- For study design and policy:
- Prefer open-source or time-stamped models where provenance is known; lack of transparency in commercial models can threaten validity of prediction tasks.
- Pre-registration and explicit reporting of validation/sample-splitting rules become more important given LLM brittleness.
- For future research in AI economics:
- Develop practical leakage-detection tools and standards for certifying model training cutoffs.
- Extend debiasing and validation methods to non-linear downstream estimands, complex causal designs, and high-dimensional settings.
- Study cost–precision tradeoffs: optimal allocation between validation labeling effort and reliance on LLM outputs.
- Investigate protocol standards (benchmarks, validation sets) tailored to economic text tasks to improve comparability across studies.
- Broader takeaway: LLMs can substantially expand empirical work by scaling text processing, but valid inference requires simple, enforceable design rules (no leakage for prediction; validation+debiasing for estimation). When used appropriately, LLMs act as amplifiers of human validation, not as drop-in replacements.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost. Research Productivity | positive | high | ability to analyze text at scale and cost |
0.12
|
| Researchers can now revisit old questions and tackle novel ones with rich data using LLMs. Research Productivity | positive | high | ability to (re)address research questions using textual data |
0.12
|
| The paper provides an econometric framework for realizing the potential of LLMs in two empirical uses: prediction problems and estimation problems. Research Productivity | positive | high | methodological framework for empirical use of LLMs |
0.12
|
| For prediction problems—forecasting outcomes from text—valid conclusions require 'no training leakage' between the LLM's training data and the researcher's sample, which can be enforced through careful model choice and research design. Output Quality | negative | high | validity of predictive conclusions from text |
0.12
|
| For estimation problems—automating the measurement of economic concepts for downstream analysis—valid downstream inference requires combining LLM outputs with a small validation sample to deliver consistent and precise estimates. Output Quality | positive | high | consistency and precision of downstream estimates derived from LLM-measured variables |
0.12
|
| Absent a validation sample, researchers cannot assess possible errors in LLM outputs, and consequently seemingly innocuous choices (which model, which prompt) can produce dramatically different parameter estimates. Error Rate | negative | high | robustness/error in parameter estimates derived from LLM outputs |
0.12
|
| When used appropriately, LLMs are powerful tools that can expand the frontier of empirical economics. Research Productivity | positive | high | expansion of empirical economics research capabilities |
0.02
|