LLM-based ‘digital twins’ built from routine panel data can reliably predict individual survey responses—best models reach 78.8% accuracy—while raw dialog embeddings and deeper information raise performance but gains taper past the 75% entropy point.
LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a $3 \times 5 \times 2 \times 2$ construction-method grid that covers three open-weights LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes, scoring over 2.1 million twin responses on 500 participants and 183 held-out questions. Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth, while an explicit thinking mode raises rank-order correlation without moving accuracy. Best-cell accuracy reaches 78.8 percent and Fisher-$z$ correlation reaches $r = 0.590$ on the SOEP held-out evaluation set. The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions that this paper now maps.
Summary
Main Finding
LLM-based individual-level digital twins can be constructed from pre-existing heterogeneous panel data (the SOEP) with competitive performance to twins built on purpose-collected instruments. Twin quality improves monotonically with added person-level information (measured by normalized Shannon entropy) but shows diminishing returns past the 75% entropy quartile — a practical Pareto point. Embedding past responses as a raw dialog history (vs. a narrative persona summary) uniformly raises hold-out accuracy at full context, and prompting the model to "think" (explicit reasoning) improves rank-order correlation without increasing accuracy. Best results: 78.8% accuracy (single-cell) and Fisher-z rank-order correlation r = 0.590 on the SOEP held-out set.
Key Points
- Data substrate: Pre-existing heterogeneous panel data (SOEP) — similar in shape to CRM/loyalty/repeat-survey collections firms hold — is sufficient to build detailed individual twins.
- Experimental grid: Authors evaluated a 3 × 5 × 2 × 2 construction-method space (three open-weight LLMs, five cumulative information depths by normalized Shannon entropy, two embedding methods, two reasoning modes) → 60 cells.
- Scale: >2.1 million twin responses scored across 500 participants and 183 held-out questions.
- Information depth: Accuracy and rank-order correlation rise with included information on a concave curve; the 75% entropy quartile captures most signal with substantially lower prompt token cost than 100%.
- Embeddings: Using a raw dialog history of an individual’s past responses outperforms a condensed narrative persona summary on hold-out accuracy (especially at 100% depth) across models and reasoning modes.
- Reasoning mode: Explicit chain-of-thought / thinking prompts raised rank-order correlation (better at reproducing who ranks higher than whom) but did not materially change per-item accuracy.
- Model differences: Among the open-weight models tested (including Qwen 3 and Gemma 4), Gemma 4 attained the single best accuracy cell (78.8%) while Qwen 3 achieved the highest rank-order correlation (r = 0.590); model choice materially affects metrics.
- Failure modes and caveats: Results sit in the same order of magnitude as top twins trained on bespoke data, but documented twin biases remain relevant (shrinkage toward base-model priors, reduced response variance, potential demographic stereotyping and alignment-induced drift).
Data & Methods
- Source data: German Socio-Economic Panel (SOEP) Core v40 EU Edition — long-running, representative longitudinal household panel with many repeat items per individual across commercially relevant domains.
- Sample & evaluation: 500 participants sampled; 183 held-out questions used for evaluation; construction conditioned on the remaining items.
- Construction grid:
- Models: three open-weight LLMs (explicitly named top performers: Qwen 3 and Gemma 4).
- Information depth: five cumulative depths ranked by normalized Shannon entropy (cumulative inclusion of items ordered by informativeness); analysis highlights 75% and 100% points.
- Embeddings: (1) narrative persona summary vs (2) raw dialog history of past responses.
- Reasoning modes: (1) explicit thinking / chain-of-thought vs (2) standard direct-answer mode.
- Scale of experiments: 3 × 5 × 2 × 2 = 60 cells; total >2.1M generated answers scored.
- Evaluation metrics:
- Accuracy: normalized closeness of twin answer to human answer on item’s natural scale (1 = perfect match).
- Rank-order correlation: per-question cross-respondent correlation (how well twins reproduce who scores higher than whom).
- Dispersion ratio: twin-to-human response variance ratio (measures over/under-dispersion of twin responses).
- Implementation choices: Open-weight models used for reproducibility and deployability within regulated or license-bound settings; prompt token costs and context budgeting explicitly considered.
Implications for AI Economics
- Operational feasibility: Firms can leverage existing panel/CRM/loyalty data to generate individual-level synthetic respondents without bespoke twin-focused data collection, lowering marginal research costs and enabling faster, scalable market research.
- Cost–performance trade-off: The 75% entropy quartile is an actionable context-budgeting rule — most predictive signal comes before the final 25% of items, so practitioners can balance token/compute costs against marginal gains by stopping near that Pareto point.
- Design levers matter: Embedding strategy (dialog history > persona summary) and reasoning prompting (better rank-order reproduction) are inexpensive, high-impact levers; model selection remains a primary driver of performance and should be treated as a procurement decision in research pipelines.
- Aggregation and inference: Individual twins can be aggregated post hoc to arbitrary segments (an advantage over coarse persona bots), but researchers must be aware that individual-level distortions (e.g., shrinkage, reduced variance, stereotyping) will propagate to population-level inferences unless calibrated.
- Calibration & validation: Regular held-out evaluation against human responses and calibration (e.g., reweighting or variance-correction) are necessary to mitigate biases and distributional mismatch before relying on synthetic panels for policy or economic inference.
- Market and policy effects: Scalable synthetic respondents could reduce demand for costly large-sample surveys, change pricing and labor for market-research services, and raise regulatory questions about data provenance, respondent privacy, and contamination of human survey ecosystems.
- Research agenda: Further work should map (i) how bias patterns vary across domains and demographics in real-world panels, (ii) methods to correct under-dispersion and stereotyping, and (iii) external validity of twin-aggregated causal estimates (treatment-effect replication).
Limitations and cautions: The paper focuses on open-weight models and a German panel; results may vary with proprietary LLMs, other countries/datasets, or different item banks. Known twin failure modes persist and require mitigation before deployment in high-stakes economic decision-making.
Assessment
Claims (11)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Most published twins are either coarse persona bots conditioned on a few demographic questions or detailed individual-level twins built on purpose-collected surveys and interview transcripts. Other | mixed | high | types of published digital twins (coarse persona bots vs. detailed individual-level twins) |
0.18
|
| Neither setup speaks to the operationally most relevant case for marketing practice: building detailed individual twins from the pre-existing heterogeneous panel data that firms already accumulate through CRM systems, loyalty programs, and repeat surveys. Other | negative | high | applicability of existing twin construction approaches to pre-existing heterogeneous panel data |
0.18
|
| We construct detailed individual-level twins from the German Socio-Economic Panel (SOEP) and evaluate them across a 3 × 5 × 2 × 2 construction-method grid. Other | null_result | high | feasibility of constructing and evaluating detailed individual-level twins from SOEP |
n=500
0.3
|
| The construction-method grid covers three open-weight LLMs, five cumulative information depths ranked by normalized Shannon entropy, two embedding methods, and two reasoning modes. Other | null_result | high | experimental factorization of model types, information depths, embedding methods, and reasoning modes |
0.3
|
| We scored over 2.1 million twin responses on 500 participants and 183 held-out questions. Other | null_result | high | number of evaluated twin responses / evaluation scale |
n=500
2.1 million twin responses; 183 held-out questions
0.3
|
| Twin quality rises with information depth but with diminishing returns past the 75 percent entropy quartile, which acts as a cost-efficient Pareto point relative to the best-performing 100 percent cells. Output Quality | positive | high | twin quality (hold-out performance / accuracy) as a function of information depth |
n=500
0.3
|
| Switching the embedding from a narrative persona summary to a raw dialog history of past responses raises hold-out accuracy in every model-by-reasoning cell at the 100 percent depth. Output Quality | positive | high | hold-out accuracy as a function of embedding method (narrative persona summary vs. raw dialog history) |
n=500
0.3
|
| An explicit thinking mode raises rank-order correlation without moving accuracy. Output Quality | mixed | high | rank-order correlation (and accuracy) under explicit thinking mode vs. other reasoning modes |
n=500
0.3
|
| Best-cell accuracy reaches 78.8% on the SOEP held-out evaluation set. Output Quality | positive | high | hold-out accuracy (best-performing cell) |
n=500
78.8%
0.3
|
| Best-cell Fisher-z rank-order correlation reaches r = 0.590 on the SOEP held-out evaluation set. Output Quality | positive | high | rank-order correlation (Fisher-z) of twin responses vs. held-out answers |
n=500
r = 0.590
0.3
|
| The findings suggest that twin-based market research is no longer gated by data design, but by item volume, model selection, and a small set of construction-level decisions. Adoption Rate | positive | high | primary constraints on successful twin-based market research (data design vs. item volume/model selection/construction choices) |
0.18
|