Large language models can stand in for human subjects in low-stakes pilots and robustness checks, but their behavioral fidelity is inconsistent and error-prone; researchers should treat synthetic participants as heuristic supplements—not substitutes—and adopt standardized evaluation and ethical safeguards.
<title>Abstract</title> In recent years, the prospect of Large Language Models (LLMs) for simulating participants within various research and data collection methods has been interrogated extensively. Its proponents cite aspirational promises, including high flexibility, adaptability, better representation and reduced research costs, all by leveraging the encoded wisdom of the internet crowd. Empirical studies paint a more nuanced but fragmented picture, with mixed results, heterogeneous methods and a saturation of different perspectives. In this systematic literature review, we delineate a clear and comprehensive conceptual understanding of LLM-generated participants and their comparative relationship to human samples. We synthesize the findings from 182 studies, obtained through a hybrid database and reference search, followed by a rigorous quality curation. Grounded in generalizable indicators, we present a standardized categorization of four fundamental issues that impact synthetic participants across diverse types of simulations – cognitive misalignments, distortions, misleading believability, and overfitting/contamination. Despite the survey revealing integrations of different LLMs, prompt engineering techniques, and participant or environment modeling methods, the fidelity improvements they demonstrated remain modest. At their most representative, LLMs may stochastically parrot data they were pre-trained on or fine-tuned with. To set appropriate expectations, explain their limitations and inform future applications, we propose the framing of synthetic participants as heuristic-like. Additionally, we discuss evaluation measures, specific supplemental roles that synthetic participants can be valid for, the underexplored potential of augmentative approaches, as well as a critical professional, social and ethical consideration of simulated insights.
Summary
Main Finding
LLM-generated synthetic participants show promise as a low-cost, flexible adjunct for research and data-collection tasks, but across 182 reviewed studies their fidelity to human participants is modest and inconsistent. Major failure modes—cognitive misalignments, distortions, misleading believability, and overfitting/contamination—limit their suitability as direct substitutes for human samples. The authors recommend treating synthetic participants as heuristic-like tools (useful for specific supplemental roles) rather than as full replacements, and call for standardized evaluation, careful application, and ethical safeguards.
Key Points
- Scope: Systematic review of 182 studies comparing LLM-generated participants to human samples across diverse simulation types, models, and prompting/method variants.
- Four standardized failure categories affecting synthetic participants:
- Cognitive misalignments — differences in reasoning, goals, and bounded rationality compared with humans.
- Distortions — systematic biases in outputs relative to target human distributions.
- Misleading believability — outputs that look plausible but are incorrect or unrepresentative.
- Overfitting/contamination — reproduction of pre-training or fine-tuning data (stochastic parroting) and leakage from training sets.
- Heterogeneous literature: mixed results, varied methods, and fragmented evaluation metrics impede general conclusions about when LLMs reliably mimic humans.
- Fidelity gains from prompt engineering, model selection, or participant/environment modeling have been limited and context-dependent.
- Recommendation: conceptualize synthetic participants as heuristics—useful to explore ideas, stress-test designs, augment limited human data, or run low-stakes pilots, but not to replace empirical human samples without rigorous validation.
- Additional topics covered: evaluation metrics, promising augmentative (hybrid) approaches, and professional, social, and ethical concerns (e.g., transparency, consent, contamination of human-data pools).
Data & Methods
- Literature assembled via a hybrid search strategy combining electronic database queries and reference/back-citation searches to identify empirical and methodological studies involving LLM-generated participants.
- Screening and curation: rigorous quality filters were applied to select 182 studies for synthesis (details on inclusion/exclusion criteria and quality metrics are implied by the abstract).
- Synthesis approach: studies were coded against generalizable indicators and organized into a standardized taxonomy centered on the four failure categories listed above.
- Scope of methods in reviewed studies: wide variation — different LLM families and sizes, prompt engineering techniques, participant persona modeling, environment simulations, and evaluation protocols. This heterogeneity motivated the need for a unified conceptual framing rather than a single performance estimate.
Implications for AI Economics
- Use cases where LLM participants can add value:
- Pilot testing and rapid prototyping of surveys, experiments, games, and mechanism designs to identify gross design flaws before costly human trials.
- Generating hypotheses or counterfactual scenarios for exploratory work and scenario analysis.
- Stress-testing economic models by enumerating extreme or rare behaviors, or creating catalogues of plausible agent responses for robustness checks.
- Augmenting small human samples in hybrid designs (LLM + human) with careful validation to increase variety at lower marginal cost.
- Teaching, training, and constructing illustrative examples for pedagogy and communication.
- Risks and limitations for economic research:
- External validity and causal inference risk: cognitive misalignments and distortions can bias estimated behaviors, preferences, or treatment effects.
- Strategic and game-theoretic settings: LLMs may misrepresent incentives, dynamic strategic thinking, and bounded rationality, producing misleading equilibria or mechanistic responses.
- Contamination/data leakage: stochastic parroting risks reusing training-set content, which can invalidate novelty claims, leak sensitive information, or bias empirical baselines.
- Misleading believability can produce high-confidence but incorrect inferences if synthetic outputs are not rigorously calibrated against human data.
- Research design recommendations for economists:
- Treat LLM participants as supplementary/heuristic tools: pre-register how synthetic data will be used, and avoid substituting synthetic for human data in confirmatory causal studies without validation.
- Validate synthetic outputs against held-out human samples, using clear metrics for behavioral fidelity relevant to the economic question (e.g., revealed preferences, strategy choice distributions, response latencies if relevant).
- Use hybrid designs with human-in-the-loop checks, adversarial probing, and sensitivity analyses to quantify how conclusions change with synthetic vs. human inputs.
- Monitor and mitigate contamination: track model training sources when possible, avoid tasks that require detecting rare events or copyrighted/sensitive content, and disclose use of LLM-generated participants.
- Invest in standardized benchmarks for economic behaviors (strategic interaction, intertemporal choice, risk, social preferences) to make cross-study comparisons possible.
- Policy and ethical considerations:
- Transparency: disclose synthetic participant use in publications and reports.
- Consent and representation: avoid presenting synthetic results as human behavior without clear labeling; consider effects on participant pools and labor markets for research subjects.
- Reproducibility: document prompts, model versions, seeds, and any fine-tuning to allow audit and replication.
- Bottom line for economists: LLMs can reduce costs and accelerate exploratory work, but their current limitations mean they are best used to complement—not replace—human subjects in empirical economic research unless rigorous, domain-specific validation demonstrates equivalence for the question at hand.
Assessment
Claims (14)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Across 182 reviewed studies, LLM-generated synthetic participants have modest and inconsistent fidelity to human participants. Research Productivity | mixed | high | fidelity of synthetic participants to human participants (behavioral/response similarity) |
n=182
0.24
|
| LLM-generated synthetic participants are a promising low-cost, flexible adjunct for research and data-collection tasks (useful for pilots, prototyping, hypothesis generation, stress-testing, and augmenting small human samples). Research Productivity | positive | medium | utility in research workflows (cost, speed, ability to detect gross design flaws, exploratory value) |
n=182
0.14
|
| Major failure modes limiting synthetic participants as direct substitutes for humans are: cognitive misalignments, distortions, misleading believability, and overfitting/contamination. Research Productivity | negative | high | types and frequency of fidelity failures (categorical classification of failure modes) |
n=182
0.24
|
| Cognitive misalignments: LLMs differ from humans in reasoning, goals, and bounded rationality, which can alter behavior in economic and strategic tasks. Research Productivity | negative | high | alignment of reasoning processes and goal-directed responses between LLMs and humans |
n=182
0.24
|
| Distortions: LLM outputs can exhibit systematic biases relative to target human distributions. Research Productivity | negative | high | distributional deviations between LLM-generated responses and human responses (biases) |
n=182
0.24
|
| Misleading believability: LLM outputs may look plausible but be incorrect or unrepresentative, risking overconfidence in synthetic data. Research Productivity | negative | high | rate of plausible-but-incorrect or unrepresentative outputs (perceived plausibility vs. ground-truth accuracy) |
n=182
0.24
|
| Overfitting/contamination: LLMs can reproduce pre-training or fine-tuning data (stochastic parroting) and leak training-set content into outputs. Research Productivity | negative | high | occurrence of memorized or training-set-specific content in generated outputs |
n=182
0.24
|
| The literature is heterogeneous (different LLM families/sizes, prompting techniques, participant persona modeling, environments, and evaluation protocols), which impedes general conclusions about when LLMs reliably mimic humans. Research Productivity | null_result | high | methodological heterogeneity across studies (variance in models, prompts, evaluation metrics) |
n=182
0.24
|
| Fidelity gains from prompt engineering, model selection, or participant/environment modeling have been limited and context-dependent. Research Productivity | mixed | medium | change in fidelity metrics following prompt engineering, model selection, or environment/participant modeling |
n=182
0.14
|
| LLM-generated participants are particularly risky in strategic and game-theoretic settings because they may misrepresent incentives, dynamic strategic thinking, and bounded rationality. Research Productivity | negative | medium | accuracy of strategic decisions, equilibrium behavior, and incentive-respecting responses compared to humans |
n=182
0.14
|
| Recommendation: Treat synthetic participants as heuristic tools (supplemental roles) rather than replacements; use hybrid designs, validate against held-out human samples, pre-register synthetic-data usage, and adopt transparency and reproducibility practices (document prompts, model versions, seeds, fine-tuning). Research Productivity | positive | high | recommended research practices and safeguards (use-case guidelines, validation procedures, disclosure standards) |
n=182
0.24
|
| There is a need for standardized benchmarks for economic behaviors (e.g., strategic interaction, intertemporal choice, risk, social preferences) to enable cross-study comparisons and rigorous validation of synthetic participants. Research Productivity | positive | medium | existence and adoption of standardized benchmarks for evaluating LLM behavioral fidelity in economic domains |
n=182
0.14
|
| Using LLM participants without rigorous validation can bias external validity and causal inference in economic research. Research Productivity | negative | high | bias in estimated behaviors, preferences, or causal effects when using synthetic participants unvalidated against humans |
n=182
0.24
|
| Ethical and policy considerations require disclosure of synthetic participant use, protection against contamination of human-data pools, and attention to consent and representation issues. Governance And Regulation | positive | medium | adoption of disclosure, consent, and data-pool protection practices in studies using synthetic participants |
0.14
|