The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

LLM-based ‘digital twins’ built from routine panel data can reliably predict individual survey responses—best models reach 78.8% accuracy—while raw dialog embeddings and deeper information raise performance but gains taper past the 75% entropy point.

Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?
Leonard Kinzinger, Jochen Hartmann · June 03, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
Building detailed individual digital twins from existing panel data can predict held-out survey responses with high accuracy (best-cell accuracy 78.8% and Fisher-z r=0.590), with performance rising with information depth but showing diminishing returns past the 75% entropy quartile, and improvements from raw dialog embeddings and explicit reasoning modes.

LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a $3 \times 5 \times 2 \times 2$ construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher-$z$ correlation reaches $r = 0.590$ on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.

Summary

Main Finding

LLM-based individual-level digital twins can be constructed from pre-existing heterogeneous panel data (the SOEP) with competitive performance to twins built on purpose-collected instruments. Twin quality improves monotonically with added person-level information (measured by normalized Shannon entropy) but shows diminishing returns past the 75% entropy quartile — a practical Pareto point. Embedding past responses as a raw dialog history (vs. a narrative persona summary) uniformly raises hold-out accuracy at full context, and prompting the model to "think" (explicit reasoning) improves rank-order correlation without increasing accuracy. Best results: 78.8% accuracy (single-cell) and Fisher-z rank-order correlation r = 0.590 on the SOEP held-out set.

Key Points

  • Data substrate: Pre-existing heterogeneous panel data (SOEP) — similar in shape to CRM/loyalty/repeat-survey collections firms hold — is sufficient to build detailed individual twins.
  • Experimental grid: Authors evaluated a 3 × 5 × 2 × 2 construction-method space (three open-weight LLMs, five cumulative information depths by normalized Shannon entropy, two embedding methods, two reasoning modes) → 60 cells.
  • Scale: >2.1 million twin responses scored across 500 participants and 183 held-out questions.
  • Information depth: Accuracy and rank-order correlation rise with included information on a concave curve; the 75% entropy quartile captures most signal with substantially lower prompt token cost than 100%.
  • Embeddings: Using a raw dialog history of an individual’s past responses outperforms a condensed narrative persona summary on hold-out accuracy (especially at 100% depth) across models and reasoning modes.
  • Reasoning mode: Explicit chain-of-thought / thinking prompts raised rank-order correlation (better at reproducing who ranks higher than whom) but did not materially change per-item accuracy.
  • Model differences: Among the open-weight models tested (including Qwen 3 and Gemma 4), Gemma 4 attained the single best accuracy cell (78.8%) while Qwen 3 achieved the highest rank-order correlation (r = 0.590); model choice materially affects metrics.
  • Failure modes and caveats: Results sit in the same order of magnitude as top twins trained on bespoke data, but documented twin biases remain relevant (shrinkage toward base-model priors, reduced response variance, potential demographic stereotyping and alignment-induced drift).

Data & Methods

  • Source data: German Socio-Economic Panel (SOEP) Core v40 EU Edition — long-running, representative longitudinal household panel with many repeat items per individual across commercially relevant domains.
  • Sample & evaluation: 500 participants sampled; 183 held-out questions used for evaluation; construction conditioned on the remaining items.
  • Construction grid:
    • Models: three open-weight LLMs (explicitly named top performers: Qwen 3 and Gemma 4).
    • Information depth: five cumulative depths ranked by normalized Shannon entropy (cumulative inclusion of items ordered by informativeness); analysis highlights 75% and 100% points.
    • Embeddings: (1) narrative persona summary vs (2) raw dialog history of past responses.
    • Reasoning modes: (1) explicit thinking / chain-of-thought vs (2) standard direct-answer mode.
  • Scale of experiments: 3 × 5 × 2 × 2 = 60 cells; total >2.1M generated answers scored.
  • Evaluation metrics:
    • Accuracy: normalized closeness of twin answer to human answer on item’s natural scale (1 = perfect match).
    • Rank-order correlation: per-question cross-respondent correlation (how well twins reproduce who scores higher than whom).
    • Dispersion ratio: twin-to-human response variance ratio (measures over/under-dispersion of twin responses).
  • Implementation choices: Open-weight models used for reproducibility and deployability within regulated or license-bound settings; prompt token costs and context budgeting explicitly considered.

Implications for AI Economics

  • Operational feasibility: Firms can leverage existing panel/CRM/loyalty data to generate individual-level synthetic respondents without bespoke twin-focused data collection, lowering marginal research costs and enabling faster, scalable market research.
  • Cost–performance trade-off: The 75% entropy quartile is an actionable context-budgeting rule — most predictive signal comes before the final 25% of items, so practitioners can balance token/compute costs against marginal gains by stopping near that Pareto point.
  • Design levers matter: Embedding strategy (dialog history > persona summary) and reasoning prompting (better rank-order reproduction) are inexpensive, high-impact levers; model selection remains a primary driver of performance and should be treated as a procurement decision in research pipelines.
  • Aggregation and inference: Individual twins can be aggregated post hoc to arbitrary segments (an advantage over coarse persona bots), but researchers must be aware that individual-level distortions (e.g., shrinkage, reduced variance, stereotyping) will propagate to population-level inferences unless calibrated.
  • Calibration & validation: Regular held-out evaluation against human responses and calibration (e.g., reweighting or variance-correction) are necessary to mitigate biases and distributional mismatch before relying on synthetic panels for policy or economic inference.
  • Market and policy effects: Scalable synthetic respondents could reduce demand for costly large-sample surveys, change pricing and labor for market-research services, and raise regulatory questions about data provenance, respondent privacy, and contamination of human survey ecosystems.
  • Research agenda: Further work should map (i) how bias patterns vary across domains and demographics in real-world panels, (ii) methods to correct under-dispersion and stereotyping, and (iii) external validity of twin-aggregated causal estimates (treatment-effect replication).

Limitations and cautions: The paper focuses on open-weight models and a German panel; results may vary with proprietary LLMs, other countries/datasets, or different item banks. Known twin failure modes persist and require mitigation before deployment in high-stakes economic decision-making.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper reports extensive out-of-sample evaluation (500 participants, 183 held-out questions, >2.1M responses) that supports claims about twin predictive performance, but it does not establish causal effects on economic outcomes (e.g., firm productivity, sales) and is limited to prediction accuracy on a single survey panel. Methods Rigorhigh — Systematic factorial evaluation across model, information-depth, embedding, and reasoning-mode dimensions; large number of response draws; held-out evaluation set and multiple metrics (accuracy, Fisher-z correlation); clear reporting of diminishing returns and Pareto points. SampleGerman Socio-Economic Panel (SOEP) data used to construct individual-level digital twins for 500 participants; evaluation on 183 held-out survey questions; experiments run across three open-weight LLMs, five cumulative information-depth levels (ranked by normalized Shannon entropy), two embedding methods (narrative persona summary vs. raw dialog history), and two reasoning modes, producing ~2.1 million twin responses. Themesproductivity human_ai_collab adoption GeneralizabilitySingle-country sample (Germany) from SOEP—may not generalize to other countries or cultural contexts, Panel survey responses differ from real-world CRM/transactional behaviors that firms collect, limiting external validity for purchase modeling, Relatively small participant set (500) for cross-population heterogeneity analyses, Only three open-weight LLMs tested; results may change with other models (closed weights, larger/smaller architectures), Evaluation focuses on survey question prediction, not downstream economic or managerial outcomes (e.g., purchase, retention, ROI), Held-out questions are from the same instrument—limited evidence on temporal transfer or out-of-domain prompts, Embedding and prompt engineering choices tested are not exhaustive and performance may vary with prompt shifts

Claims (11)

ClaimDirectionConfidenceOutcomeDetails
Most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Other mixed high types of published digital twins (coarse persona bots vs. detailed individual-level twins)
0.18
Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. Other negative high applicability of existing twin construction approaches to pre-existing heterogeneous panel data
0.18
We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a 3 × 5 × 2 × 2 construction-method grid. Other null_result high feasibility of constructing and evaluating detailed individual-level twins from SOEP
n=500
0.3
The construction-method grid covers three open-weight LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes. Other null_result high experimental factorization of model types, information depths, embedding methods, and reasoning modes
0.3
We scored over 2.1 million twin responses on 500 participants and 183 held-out questions. Other null_result high number of evaluated twin responses / evaluation scale
n=500
2.1 million twin responses; 183 held-out questions
0.3
Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Output Quality positive high twin quality (hold-out performance / accuracy) as a function of information depth
n=500
0.3
Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth. Output Quality positive high hold-out accuracy as a function of embedding method (narrative persona summary vs. raw dialog history)
n=500
0.3
An explicit thinking mode raises rank-order correlation without moving accuracy. Output Quality mixed high rank-order correlation (and accuracy) under explicit thinking mode vs. other reasoning modes
n=500
0.3
Best-cell accuracy reaches 78.8% on the SOEP held-out evaluation set. Output Quality positive high hold-out accuracy (best-performing cell)
n=500
78.8%
0.3
Best-cell Fisher-z rank-order correlation reaches r = 0.590 on the SOEP held-out evaluation set. Output Quality positive high rank-order correlation (Fisher-z) of twin responses vs. held-out answers
n=500
r = 0.590
0.3
The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions. Adoption Rate positive high primary constraints on successful twin-based market research (data design vs. item volume/model selection/construction choices)
0.18

Notes