Detailed life-history prompts make LLM-based resident simulations noticeably more faithful, and curriculum-LoRA cuts personalization cost by about 10x while matching top-tier fidelity, bringing practical in-silico policy testing within reach for cash-strapped local governments.

Benchmarking LLMs for Community Governance Simulation with Life-history Narratives

Xu Chen, Yuanzi Li, Lei Wang, Nan Lu, Yang Wang, Anding Wang, Lei Shi, Xiaoxing Fu, Ji-Rong Wen · May 22, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Augmenting LLM prompts with rich, interview-derived life-history profiles substantially improves simulated resident fidelity, and curriculum-LoRA attains comparable fidelity to the best baselines at roughly one-tenth the per-call cost, enabling practical in-silico governance testing for resource-constrained localities.

Effective community governance hinges on understanding what specific residents think and need. Recent work has used large language models (LLMs) to simulate human respondents, offering a scalable, reproducible way to study human attitudes and behaviors at low cost. However, these studies typically prompt the model with just a few demographic variables (age, gender, income), simulating only general role types. This is insufficient for community governance, where decisions depend on the views of specific residents. We bridge this gap with an integrated research framework covering dataset, benchmark, algorithm, and system. The dataset comprises approximately 1.2 million characters of first-person narrative collected through two-hour semi-structured interviews with each of 92 residents in an urban community, organized around nine community-governance domains. The benchmark probes 18 mainstream LLMs across four prompting strategies and shows that adding rich life-history profiles meaningfully raises fidelity above the no-profile baseline, but this gain comes with more input tokens per call from the longer prompts they require. The algorithm, curriculum-LoRA, is a parameter-efficient personalization framework that, by closing this fidelity-cost gap, matches the strongest baseline's fidelity at roughly 10x lower per-call cost and Pareto-dominates every configuration tested. The system integrates curriculum-LoRA into a closed-loop policy-evaluation pipeline. Together, these results bring individual-level LLM-based resident simulation within reach of resource-constrained local administrations, enabling community-governance decisions to be systematically pre-evaluated in silico before real-world deployment.

Summary

Main Finding

Conditioning LLMs on rich, interview-derived life-history narratives materially improves fidelity for individual-level resident simulation in community governance, but pure prompting creates an unfavorable accuracy–cost trade-off. Curriculum-LoRA—a parameter-efficient personalization method that combines LoRA fine-tuning with online random sampling of reference QA and a curriculum that gradually increases reference context—matches or exceeds top prompting fidelity while reducing per-call inference cost by roughly an order of magnitude, and Pareto-dominates tested configurations when jointly considering accuracy and monetary cost.

Key Points

Dataset scale and novelty
- 92 long-term residents interviewed ~2 hours each; ~1.2 million characters of first-person life-history narrative (mean ≈13k characters per resident).
- Structured 50-item policy-attitude instrument covering 9 governance domains (e.g., elevator funding, parking, civic engagement).
- For benchmarking, 60 residents with information-rich responses produced 2,291 valid (resident, question, answer) triples.
Benchmarking results
- Evaluated 18 mainstream LLMs (open-source families and proprietary frontier models) × 4 prompting strategies (No Prompt, Only Life-history, Only Few-shot, Life-history & Few-shot) and swept few-shot sizes (0–10).
- Conditioning on life-history raised accuracy on average by ~5.6 percentage points without few-shot and ~3.1 pp with few-shot; best prompting plateaus near ~50% exact-match accuracy.
- In-context few-shot delivers diminishing returns: largest jump 0→2 shots; marginal gains shrink beyond that.
- Layer-wise probing (on an open-weight model) identifies an “attitudinal inference window” where life-history content is transformed into preference features (onset ~layer 13, peak ~layer 46).
- Pure prompting with long life-histories is substantially more expensive per call due to long prompts (token-driven cost).
Curriculum-LoRA algorithm
- Built on a 7B-parameter base model; uses LoRA-style parameter-efficient adapters plus:
  - Online random sampling of reference QA from the same resident to exploit cross-question consistency.
  - A curriculum that gradually increases the number of reference questions during training so the model first learns basic persona→answer mapping, then richer cross-question context.
- Results: matches strongest prompting baseline fidelity (51.6% vs GLM-5’s 49.7%) while reducing per-call inference cost dramatically (CNY 0.41 vs CNY 6.54 for GLM-5; larger gaps vs GPT models).
- Generalizes to unseen residents and unseen governance domains in tests reported.
System integration
- The calibrated model is deployed in a closed-loop policy-attitude simulation pipeline enabling policy-probe design, resident-attitude simulation, result analysis, and iterative refinement—bringing in-silico pre-evaluation of community policies within reach for resource-constrained local administrations.

Data & Methods

Data collection
- Stratified sampling for demographic balance (age, gender, education); semi-structured interviews ~2 hours each with trained fieldworkers; transcription, de-identification, cleaning.
- Life-history profiles divided into four blocks: P1 (personal/growth), P2 (education/work/migration), P3 (family/care), P4 (community interaction/values).
- Policy-attitude responses captured via a 50-item instrument across 9 domains; items include factual and normative questions.
Benchmark construction
- From 92 participants, selected 60 with complete, unambiguous responses for quantitative benchmarking (2,291 valid triples).
- Held-out evaluation questions per resident; reference QA exposed to model as in-context conditioning; evaluation by exact-match scoring.
- Exhaustive subset sweeps over life-history blocks (16 subsets).
Models and prompting
- 18 LLMs: open-source (e.g., Qwen, GLM, Kimi) and proprietary frontier models (GPT family, Gemini, Grok).
- Prompting strategies: No Prompt; Only Life-history (long persona text); Only Few-shot (reference QA examples); Life-history & Few-shot.
- Shot counts swept 0–10.
Algorithmic innovation
- Curriculum-LoRA: parameter-efficient personalization leveraging LoRA adapters, online random sampling across a resident’s reference QA, and a curriculum schedule for increasing context size during fine-tuning.
- Evaluation metrics: exact-match accuracy for held-out questions; monetary per-call cost computed under token/inference pricing (reported in CNY).
Mechanistic probing
- Layer-wise representation probing via linear classifiers across transformer layers to locate where attitudinal signals emerge from life-history input.

Implications for AI Economics

Accuracy–cost trade-offs shape adoption
- The study quantifies a practical trade-off: long life-history prompts improve fidelity but drive up per-call costs (token fees, latency). Parameter-efficient personalization (curriculum-LoRA) can shift this frontier, making individualized simulation economically feasible for low-budget public actors.
- Implication: pricing structures (per-token vs model-subscription) and availability of small, fine-tunable base models will materially affect which organizations can adopt individualized LLM simulations.
Market demand and model design
- Demonstrates demand for smaller, tunable models and parameter-efficient personalization tools rather than only frontier models. Vendors may find market opportunities in providing:
  - affordable fine-tunable backbones, LoRA/adapter toolchains, or managed personalization-as-a-service targeted at governments and NGOs.
  - pricing tiers that reflect value of personalization (not just raw capability).
Public-sector cost savings and welfare
- In-silico policy pretesting reduces the need for expensive large-scale surveys and repeated field experiments—potentially lowering the marginal cost of policy iteration and increasing responsiveness in local governance. This can improve allocative efficiency of public services and reduce policy implementation failures.
- But welfare gains depend on fidelity and external validity; overconfidence in simulated residents risks misallocation if models propagate biases.
Externalities and distributional concerns
- If large models remain costly, wealthier municipalities or private actors could gain disproportionate ability to pre-evaluate and optimize policies, potentially widening governance quality gaps.
- Open-source, low-cost personalization could democratize access; policy/regulatory choices (subsidies, public-model provisioning) will affect distributional outcomes.
Value of data and privacy economics
- High-value life-history data are costly to collect and sensitive. The paper highlights that richer individual data improves fidelity—this raises trade-offs between data collection costs, privacy risks, and downstream utility. Economies of scale in collecting such data for many residents versus targeted sampling should be considered.
- Incentive design for truthful and consented data collection, and governance of model outputs (transparency, accountability), have economic implications for adoption.
Research and policy priorities
- Need to assess external validity and general equilibrium effects: how well these calibrated simulations predict behavior in different communities or under large-scale policy changes.
- Consider regulation on use of synthetic residents for policy decisions (auditability, recourse) and standards for public-sector procurement of model personalization services.

Caveats and limitations - Benchmark limited geographically and socially to one urban community and to Chinese-language interviews/context; external validity across cultures and governance systems is untested. - Evaluation uses exact-match scoring on structured QA—this may understate partial correctness or nuanced alignment. - Ethical, privacy, and consent constraints around life-history data are substantial; deployment requires careful governance. - Cost figures are specific to the paper’s pricing assumptions and local currency; relative savings are more informative than absolute numbers.

Suggested follow-ups for AI economics researchers - Cost–benefit analyses comparing in-silico policy pretesting (with curriculum-LoRA) to conventional survey/field-experiment pipelines across different municipality sizes. - Study of market impacts: pricing models for personalized LLM services and how subsidies/public-provisioning affect adoption and equity. - External-validity tests across diverse communities and languages; welfare analyses accounting for model bias and decision risks. - Work on governance frameworks and procurement models that balance utility, privacy, and accountability for public-sector adoption of personalized LLMs.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides systematic empirical evaluation (92 in-depth interviews; benchmark across 18 LLMs; ablations of prompting strategies) and a quantitative cost-fidelity comparison, so claims about model fidelity and cost-efficiency are directly supported by experiments. However, evidence is limited to a single urban community, relies on fidelity metrics (which may be subjective), and does not demonstrate that simulated responses produce accurate downstream policy impacts in the real world. Methods Rigormedium — Methodological strengths include a sizable qualitative dataset (two-hour semi-structured interviews per respondent), structured benchmarks across many models and prompting strategies, and a parameter-efficient personalization approach (curriculum-LoRA) with measured cost metrics; weaknesses include a relatively small and location-specific sample (92 residents), potential subjectivity in fidelity measurement, and limited external validation of simulated-to-real behavioral correspondence. Sample92 residents of a single urban community, each interviewed for approximately two hours in semi-structured first-person narrative format (total ~1.2 million characters), organized around nine community-governance domains; used both as rich life-history profiles for prompting/personalization and as ground-truth references for fidelity evaluation. Themesgovernance human_ai_collab GeneralizabilitySingle urban community sample — may not represent other communities (rural, different cultures, countries)., Small-N (92) limits demographic and behavioral heterogeneity coverage., Fidelity metrics and benchmarking may depend on annotator judgments and chosen domains., Results tied to contemporary LLMs and token/pricing regimes; model performance and costs will change over time., Simulated fidelity does not guarantee accurate prediction of real-world behavioral responses to policy interventions.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
The dataset comprises approximately 1.2 million characters of first-person narrative collected through two-hour semi-structured interviews with each of 92 residents in an urban community, organized around nine community-governance domains. Research Productivity	positive	high	size and composition of dataset (characters of first-person narrative, number of interviews, interview duration, domain coverage)	n=92 1.2 million characters 0.3
The benchmark probes 18 mainstream LLMs across four prompting strategies. Research Productivity	neutral	high	coverage of models and prompting strategies in benchmark (number of LLMs and prompting variants tested)	n=18 four prompting strategies 0.18
Adding rich life-history profiles meaningfully raises fidelity above the no-profile baseline. Output Quality	positive	high	simulation fidelity (how well LLM outputs match expected resident responses)	0.18
The fidelity gain from richer profiles comes with more input tokens per call from the longer prompts they require (i.e., higher per-call input cost). Organizational Efficiency	negative	high	per-call input token count (per-call cost proxy)	0.18
Curriculum-LoRA is a parameter-efficient personalization framework that, by closing the fidelity-cost gap, matches the strongest baseline's fidelity at roughly 10x lower per-call cost. Organizational Efficiency	positive	high	tradeoff between simulation fidelity and per-call cost (input tokens / cost per call)	roughly 10x lower per-call cost 0.18
Curriculum-LoRA Pareto-dominates every configuration tested. Organizational Efficiency	positive	high	Pareto frontier position with respect to fidelity and cost metrics	0.18
The system integrates curriculum-LoRA into a closed-loop policy-evaluation pipeline. Decision Quality	positive	high	system integration of personalization algorithm into a policy-evaluation workflow	0.09
Together, these results bring individual-level LLM-based resident simulation within reach of resource-constrained local administrations, enabling community-governance decisions to be systematically pre-evaluated in silico before real-world deployment. Decision Quality	positive	high	feasibility of in-silico pre-evaluation of community-governance decisions by resource-constrained local administrations	0.03