Synthetic Participants Generated by Large Language Models: A Systematic Literature Review

<title>Abstract</title> In recent years, the prospect of Large Language Models (LLMs) for simulating participants within various research and data collection methods has been interrogated extensively. Its proponents cite aspirational promises, including high flexibility, adaptability, better representation and reduced research costs, all by leveraging the encoded wisdom of the internet crowd. Empirical studies paint a more nuanced but fragmented picture, with mixed results, heterogeneous methods and a saturation of different perspectives. In this systematic literature review, we delineate a clear and comprehensive conceptual understanding of LLM-generated participants and their comparative relationship to human samples. We synthesize the findings from 182 studies, obtained through a hybrid database and reference search, followed by a rigorous quality curation. Grounded in generalizable indicators, we present a standardized categorization of four fundamental issues that impact synthetic participants across diverse types of simulations – cognitive misalignments, distortions, misleading believability, and overfitting/contamination. Despite the survey revealing integrations of different LLMs, prompt engineering techniques, and participant or environment modeling methods, the fidelity improvements they demonstrated remain modest. At their most representative, LLMs may stochastically parrot data they were pre-trained on or fine-tuned with. To set appropriate expectations, explain their limitations and inform future applications, we propose the framing of synthetic participants as heuristic-like. Additionally, we discuss evaluation measures, specific supplemental roles that synthetic participants can be valid for, the underexplored potential of augmentative approaches, as well as a critical professional, social and ethical consideration of simulated insights.

Summary

Main Finding

A systematic review of 182 high‑quality studies finds that LLM‑generated “synthetic participants” show promise as a low‑cost, flexible supplement to human data but do not reliably reproduce human cognitive, behavioral, or psychometric properties. The literature clusters four recurring failure modes—cognitive misalignments, distortions/biases, misleading believability, and overfitting/contamination—and shows only modest fidelity gains from prompt engineering, participant modeling, or model selection. The authors recommend treating synthetic participants as heuristic‑like tools (useful for exploration, augmentation, and some validation roles) rather than as substitutes for real human samples in high‑stakes inference.

Key Points

Corpus and scope
- Systematic review integrating 182 studies about LLM‑generated synthetic participants across many disciplines (social & political science, psychology, HCI, etc.).
Principal limitations identified (four themes)
Cognitive misalignments: LLMs lack human reasoning processes, goals, beliefs and therefore often fail on tasks that rely on human cognitive mechanisms.
Distortions and biases: generated responses reflect training‑data biases and can amplify statistical or cultural distortions.
Misleading believability: outputs can appear plausibly human (high superficial realism) while being inaccurate for substantive inference.
Overfitting & contamination: models may “parrot” or reproduce training/fine‑tune data, leading to unrealistically good apparent agreement with held‑out samples and data contamination risks.
Effectiveness of mitigation strategies
- Prompt engineering, persona/person‑modeling (single‑persona, multi‑persona, mega‑persona), and model choice (various LLMs) are widely used but generally yield modest, inconsistent fidelity improvements.
- Some techniques (e.g., chain‑of‑thought prompts, explicit persona constraints) help specific tasks, but gains are task‑ and model‑dependent.
Role framing
- Authors argue for framing synthetic participants as heuristics: cheap, fast, flexible tools best suited to exploratory analysis, stress‑testing, scenario generation, pilot studies and certain augmentative roles—not as direct replacements for human subject data in causal inference or high‑stakes decisions.
Evaluation recommendations
- Need multimodal evaluation: behavioral/psychometric fidelity, out‑of‑sample validation, contamination checks (e.g., overlap with training data), and domain‑specific benchmarks.
Ethical & professional considerations
- Risks include misuse to cut legitimate human research, opacity of provenance, bias amplification, and accountability gaps.

Data & Methods

Search strategy
- Databases: Scopus, Web of Science, IEEE Xplore, ACM Digital Library. Final keyword queries combined terms for synthetic/simulated generation, human participants, and LLM/generative AI.
- Search execution date: queries run 30 June 2025.
Corpus construction & screening
- Initial bibliographic import (pybtex) and deduplication: ~4,274 entries reduced to ~3,134 unique items.
- Two‑stage eligibility screening:
  - Stage 1: LLM‑assisted prescreening using GPT‑4 with tuned prompts to filter irrelevant items.
  - Stage 2: Manual review of prescreened abstracts/titles.
- Backward/forward reference (snowball) search applied to prescreened set, yielding additional items.
Quality assessment
- Applied four inclusion criteria: relevance to LLM synthetic participants, existence of an experiment using LLM synthetic participants, open availability of full document, and source credibility (peer‑review rank) or a 19‑item quality checklist for preprints.
- Each paper needed a full score on the four items to be included.
Final corpus and coding
- Final included studies: 182 publications (journal articles, conference papers, preprints).
- Research questions coded (RQ1–RQ6): domains, tasks, observed issues, prompting strategies, participant modeling strategies (single‑persona / multi‑persona / mega‑persona taxonomy), and model selection.
Analytic approach
- Thematic synthesis across domains, extraction of recurring failure modes and the effects of mitigation strategies; cross‑domain pattern identification.

Implications for AI Economics

For firms and product teams
- Cost vs. fidelity: Synthetic participants can reduce marginal costs for exploratory studies, scenario generation, and internal prototyping, but cannot reliably replace human samples when accurate behavioral/psychometric inference is required. Overreliance risks poor decisions and product failures.
- Market opportunities: Demand likely to grow for:
  - Validation services that measure algorithmic fidelity of synthetic participants.
  - Tools that combine LLM outputs with small human‑labeled calibration datasets (augmentative workflows).
  - Provenance, contamination‑detection, and compliance tooling.
For researchers and economists
- Appropriate use cases:
  - Exploratory analysis, hypothesis generation, stress‑testing policy responses under hypothetical distributions, power analyses, and generation of edge‑case scenarios.
  - Augmenting limited real data (e.g., data imputation, controlled synthetic minorities) but only with careful validation.
- When not to use:
  - High‑stakes policy evaluation, causal inference, clinical or safety‑critical decisions, or any context where stakes require validated external validity to human populations.
For market and valuation assessments
- Hype vs. revenue: The review tempers economic narratives that treat LLMs as direct labor replacements for human data collection. Valuation premised on massive research cost cuts may be premature unless fidelity issues are solved and regulation/standards evolve.
- Labor effects: Synthetic participants will change workflows (fewer exploratory participant recruitments, more emphasis on validation and human calibration) rather than eliminate market research roles; demand will shift toward higher‑skill validation, prompt‑engineering, and hybrid experimental design roles.
Policy and regulatory implications
- Need for standards: Robust standards for reporting synthetic participant provenance, contamination testing, and fidelity metrics to prevent misuse (e.g., replacing representative samples without disclosure).
- Ethical oversight: Regulation may be needed to ensure transparent labeling of synthetic data used in consumer research, surveys, or policy support tools.
Practical recommendations (economics audience)
- Adopt synthetic participants for low‑stakes and exploratory tasks; always perform out‑of‑sample human validation before deploying decisions based on synthetic outputs.
- Combine small, representative human samples with LLM augmentation (calibration) rather than wholesale substitution.
- Invest in evaluation protocols: behavioral/psychometric benchmarks, contamination checks (training data overlap), and measures of downstream decision impact.
- Monitor vendor claims: demand third‑party validation of fidelity claims and require standardized reporting.

Summary takeaway: LLMs can cheaply produce plausible participant‑style data and are valuable as heuristics and augmentative tools, but their current limits (cognitive mismatch, bias, contamination, and deceptive realism) constrain their use for economically important inferential or regulatory decisions. Practical economic gains will come from hybrid workflows, validation services, and governance mechanisms rather than immediate wholesale substitution of human participants.

Assessment

Paper Typereview_meta Evidence Strengthmedium — The paper is a systematic synthesis of 182 empirical and methodological studies, offering broad coverage and a standardized taxonomy of failure modes; however, the underlying primary studies are heterogeneous, use varied and often non-comparable metrics, and show mixed results, which limits the strength of aggregate empirical claims. Methods Rigormedium — The authors used a hybrid search, applied quality filters, and coded studies against a unified taxonomy, which indicates careful review practice; but key details on inclusion/exclusion criteria, inter-coder reliability, and quantitative aggregation are implied rather than fully reported, and the literature's fragmentation constrains definitive synthesis. SampleA systematic review of 182 studies that compared LLM-generated synthetic participants to human samples across diverse settings (surveys, experiments, games, mechanism-design simulations), covering multiple LLM families and sizes, prompt engineering approaches, persona/environment modeling, and varied evaluation protocols; includes both empirical performance comparisons and methodological investigations. Themeshuman_ai_collab governance GeneralizabilityFindings aggregate highly heterogeneous studies (tasks, models, prompts, metrics), limiting cross-context generalization, Rapid model advancement means conclusions may become outdated as new LLMs or fine-tuning methods emerge, Most evaluations focus on limited domains or small-scale tasks, so results may not generalize to field settings or high-stakes strategic interactions, Performance likely varies by language, culture, and population priors not represented in many studies, Potential publication/reporting bias toward positive or novel synthetic participant claims, Degree of contamination/overfitting depends on proprietary training data access and is model-specific

Claims (14)

Claim	Direction	Confidence	Outcome	Details
Across 182 reviewed studies, LLM-generated synthetic participants have modest and inconsistent fidelity to human participants. Research Productivity	mixed	high	fidelity of synthetic participants to human participants (behavioral/response similarity)	n=182 0.24
LLM-generated synthetic participants are a promising low-cost, flexible adjunct for research and data-collection tasks (useful for pilots, prototyping, hypothesis generation, stress-testing, and augmenting small human samples). Research Productivity	positive	medium	utility in research workflows (cost, speed, ability to detect gross design flaws, exploratory value)	n=182 0.14
Major failure modes limiting synthetic participants as direct substitutes for humans are: cognitive misalignments, distortions, misleading believability, and overfitting/contamination. Research Productivity	negative	high	types and frequency of fidelity failures (categorical classification of failure modes)	n=182 0.24
Cognitive misalignments: LLMs differ from humans in reasoning, goals, and bounded rationality, which can alter behavior in economic and strategic tasks. Research Productivity	negative	high	alignment of reasoning processes and goal-directed responses between LLMs and humans	n=182 0.24
Distortions: LLM outputs can exhibit systematic biases relative to target human distributions. Research Productivity	negative	high	distributional deviations between LLM-generated responses and human responses (biases)	n=182 0.24
Misleading believability: LLM outputs may look plausible but be incorrect or unrepresentative, risking overconfidence in synthetic data. Research Productivity	negative	high	rate of plausible-but-incorrect or unrepresentative outputs (perceived plausibility vs. ground-truth accuracy)	n=182 0.24
Overfitting/contamination: LLMs can reproduce pre-training or fine-tuning data (stochastic parroting) and leak training-set content into outputs. Research Productivity	negative	high	occurrence of memorized or training-set-specific content in generated outputs	n=182 0.24
The literature is heterogeneous (different LLM families/sizes, prompting techniques, participant persona modeling, environments, and evaluation protocols), which impedes general conclusions about when LLMs reliably mimic humans. Research Productivity	null_result	high	methodological heterogeneity across studies (variance in models, prompts, evaluation metrics)	n=182 0.24
Fidelity gains from prompt engineering, model selection, or participant/environment modeling have been limited and context-dependent. Research Productivity	mixed	medium	change in fidelity metrics following prompt engineering, model selection, or environment/participant modeling	n=182 0.14
LLM-generated participants are particularly risky in strategic and game-theoretic settings because they may misrepresent incentives, dynamic strategic thinking, and bounded rationality. Research Productivity	negative	medium	accuracy of strategic decisions, equilibrium behavior, and incentive-respecting responses compared to humans	n=182 0.14
Recommendation: Treat synthetic participants as heuristic tools (supplemental roles) rather than replacements; use hybrid designs, validate against held-out human samples, pre-register synthetic-data usage, and adopt transparency and reproducibility practices (document prompts, model versions, seeds, fine-tuning). Research Productivity	positive	high	recommended research practices and safeguards (use-case guidelines, validation procedures, disclosure standards)	n=182 0.24
There is a need for standardized benchmarks for economic behaviors (e.g., strategic interaction, intertemporal choice, risk, social preferences) to enable cross-study comparisons and rigorous validation of synthetic participants. Research Productivity	positive	medium	existence and adoption of standardized benchmarks for evaluating LLM behavioral fidelity in economic domains	n=182 0.14
Using LLM participants without rigorous validation can bias external validity and causal inference in economic research. Research Productivity	negative	high	bias in estimated behaviors, preferences, or causal effects when using synthetic participants unvalidated against humans	n=182 0.24
Ethical and policy considerations require disclosure of synthetic participant use, protection against contamination of human-data pools, and attention to consent and representation issues. Governance And Regulation	positive	medium	adoption of disclosure, consent, and data-pool protection practices in studies using synthetic participants	0.14