A multi‑agent LLM sandbox that learns cross‑category consumer preferences and uses mean‑field interactions improves simulated product choice and purchase forecasts, offering a promising platform for scalable market simulation—though its claims rest on simulated environments and depend heavily on the training transaction data.

MALLES: A Multi-agent LLMs-based Economic Sandbox with Consumer Preference Alignment

Yusen Wu, Yiran Liu, Xiaotie Deng · March 18, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

MALLES is a multi‑agent LLM economic sandbox that uses cross‑category preference learning, a mean‑field interaction mechanism, and a multi‑agent discussion architecture to improve simulated product choice accuracy, purchase quantity prediction, and stability relative to prior LLM simulation baselines.

In the real economy, modern decision-making is fundamentally challenged by high-dimensional, multimodal environments, which are further complicated by agent heterogeneity and combinatorial data sparsity. This paper introduces a Multi-Agent Large Language Model-based Economic Sandbox (MALLES), leveraging the inherent generalization capabilities of large-sacle models to establish a unified simulation framework applicable to cross-domain and cross-category scenarios. Central to our approach is a preference learning paradigm in which LLMs are economically aligned via post-training on extensive, heterogeneous transaction records across diverse product categories. This methodology enables the models to internalize and transfer latent consumer preference patterns, thereby mitigating the data sparsity issues prevalent in individual categories. To enhance simulation stability, we implement a mean-field mechanism designed to model the dynamic interactions between the product environment and customer populations, effectively stabilizing sampling processes within high-dimensional decision spaces. Furthermore, we propose a multi-agent discussion framework wherein specialized agents collaboratively process extensive product information. This architecture distributes cognitive load to alleviate single-agent attention bottlenecks and captures critical decision factors through structured dialogue. Experiments demonstrate that our framework achieves significant improvements in product selection accuracy, purchase quantity prediction, and simulation stability compared to existing economic and financial LLM simulation baselines. Our results substantiate the potential of large language models as a foundational pillar for high-fidelity, scalable decision simulation and latter analysis in the real economy based on foundational database.

Summary

Main Finding

MALLES (Multi-Agent LLMs-based Economic Sandbox) shows that large language models, when economically aligned via post-training on heterogeneous transaction records and deployed inside a structured multi-agent + mean-field architecture, can substantially reduce category-level data sparsity and improve fidelity of retail and wholesale decision simulations. The framework reports statistically significant gains in product-selection accuracy, purchase-quantity prediction, and simulation stability versus prior LLM-based economic simulators.

Key Points

Cross-category post-training: LLMs are post-trained on large, heterogeneous transaction logs so they internalize latent consumer preference patterns that transfer across product categories — mitigating per-category data sparsity and improving OOD generalization.
Multi-agent discussion: Long, high-dimensional product contexts are handled by a role-based multi-agent dialogue (e.g., dealer, service, manufacturer for wholesale). Agents distribute cognitive load, compress context into salient decision factors, and produce interpretable, structured decisions.
Retail vs wholesale flows: Retail agents emphasize need-satisfaction, price/discount sensitivity, and profile summaries; wholesale uses multi-round dialogues, numerical optimization, and symbolic regression to derive profit-oriented purchasing formulas.
Stabilization mechanisms:
- Mean-field alternation: macro-level response variable µt is iteratively updated from agent outputs to stabilize population-level sampling and align simulated distributions with real ones.
- Consistency regularization and multi-sampling: generate multiple perturbed responses and penalize inconsistency (Lcons) to reduce output randomness.
- Attention control: attention-matching loss (Lattn = E[KL(A || A*)]) enforces attention priors toward economic/numeric features.
- Calibration: post-process or reweight Psim to match Preal by minimizing DKL(Preal || f(Psim)) or using Wasserstein distance; mapping f and weights w(X,Z) are used to correct distributional gaps.
Interpretability: symbolic regression is embedded in the multi-agent loop to extract compact, human-readable decision formulas consistent with observed transaction data.
Hybrid cognitive components: optional cognitive core (e.g., ACT-R style) can generate candidate rule sets that the LLM evaluates, improving rule-based stability and interpretability.

Data & Methods

Data inputs:
- Transaction records (timestamped orders with customer IDs, product IDs, quantities, unit prices, discounts, channels, review scores).
- Product metadata and multimodal attributes (IDs, categories, base prices, attribute text, image embeddings).
- Dialogue logs (negotiations, agent roles, outcomes) for wholesale scenarios.
- Customer profiles (income brackets, buyer types, historical purchase summaries).
Problem setup and objective:
- True decision function ai = D(Xobs_i, Xhid_i, ρi); simulator produces ˆai = ˆD(Xobs_i, Zi, ˆρi; θ) where Zi approximates hidden context and θ are model parameters.
- Training minimizes expected loss E = E_i[ℓ(ai, ˆai)] (classification or continuous losses as appropriate).
Core methods:
- Post-training LLMs on pooled, cross-category transaction data to learn transferable preference representations and numerical sensitivity.
- Profile summarization module constructs Zi capturing long-term behavior, promotion sensitivity, brand affinity, etc., used to approximate hidden variables.
- Attention priors and attention-matching regularizer to prioritize numeric/economic inputs.
- Multi-agent role-based dialogue with iterative rounds and final parsing functions ϕretail / ϕwholesale to extract structured decisions (selection br/sw and quantity qr/qw).
- Mean-field alternating updates to align micro decisions with macro distributions and reduce population-level drift.
- Consistency loss computed over small input perturbations to reduce stochasticity.
- Calibration step via mapping f and reweighting w(X,Z) using divergence metrics (KL, Wasserstein).
- Integration of symbolic regression for compact rule discovery and interpretability.
Baselines and evaluation:
- Compared against recent LLM-based economic simulators and financial LLM frameworks (e.g., EconAgent, ABIDES-Economist, FinCon and other cited LLM baselines).
- Metrics reported include product selection accuracy, purchase quantity prediction error, and simulation stability (population-level distributional alignment). (Paper reports statistically significant improvements; exact datasets appear proprietary/transactional.)

Implications for AI Economics

Practical simulation for business decisions: MALLES provides a pathway to scalable, high-fidelity simulation of retail and wholesale decision-making that better handles long-tail categories and limited per-category data — useful for pricing, promotion design, inventory planning and procurement.
Better numerical sensitivity and multimodal alignment: the combination of post-training on transactional data plus attention priors addresses a common weakness of LLM agents (poor numeric sensitivity), making LLMs more actionable for economic policy and operational decisions.
Interpretability and theory discovery: embedding symbolic regression in multi-agent reasoning allows extraction of compact decision formulas, bridging black-box LLM outputs and economic theory / managerial rules.
Population-level consistency via mean-field methods: modeling interactions between micro agents and macro distributions is critical when using LLMs for population simulations; this approach reduces accumulation of micro errors into misleading macro patterns.
Research directions and needs:
- Benchmarking: standardized, open benchmarks are needed to validate OOD generalization and numerical sensitivity across realistic multimodal datasets.
- Data and reproducibility: the approach depends on rich transaction logs (often proprietary). Open datasets or synthetic population generators will be important for reproducible research.
- Computational and deployment trade-offs: multi-agent dialogues, multi-sampling, and mean-field iterations increase compute cost; research should explore efficient approximations and calibration schemes.
- Policy and fairness: simulation-driven decisions (pricing, inventory allocation) require careful auditing for fairness and economic externalities when models are trained on historical transactional data that may embed biases.

Overall, MALLES demonstrates a practical architecture for bringing LLMs closer to applied economic simulation by combining cross-category post-training, structured multi-agent reasoning, and stabilization/calibration layers to improve fidelity and interpretability in retail and wholesale settings.

Assessment

Paper Typedescriptive Evidence Strengthlow — Results are derived from simulation experiments using LLM-based agents and held-out transaction records rather than from real-world interventions or quasi-experimental variation; improvements are internal to the sandbox and do not establish causal effects in actual markets or firms. Methods Rigormedium — The paper proposes multiple concrete algorithmic components (pre‑training on heterogeneous transactions, a mean‑field interaction mechanism, and a multi‑agent discussion architecture) and reports comparative experiments against baselines, indicating technical sophistication; however, key methodological details that determine robustness (dataset size and representativeness, training/validation splits, hyperparameter choices, statistical significance, baseline implementations, and ablation studies) are not fully specified, and there is no external or field validation. SampleExperiments use a simulated economic sandbox populated by multiple LLM agents; models are post‑trained on 'extensive, heterogeneous transaction records across diverse product categories' to learn transferable consumer preferences; comparisons are made to existing economic and financial LLM simulation baselines for metrics such as product selection accuracy, purchase quantity prediction, and simulation stability. Exact dataset sources, sizes, and population demographics are not specified in the abstract. Themesinnovation adoption productivity human_ai_collab GeneralizabilitySandbox/simulation results may not reflect real consumer behavior in live markets (external validity), Unclear representativeness and coverage of the transaction records used for post‑training (data bias), LLM-internalized preferences may not transfer to contexts with different cultural or institutional norms, Performance may degrade for categories with extreme sparsity or nontransactional decision factors (e.g., durable goods, network effects), Scalability and computational cost limits for large real‑world markets are not addressed, Regulatory, supply-side constraints, and strategic firm behavior are likely under‑modeled

Claims (9)

Claim	Direction	Confidence	Outcome	Details
This paper introduces a Multi-Agent Large Language Model-based Economic Sandbox (MALLES) as a unified simulation framework applicable to cross-domain and cross-category scenarios. Other	positive	high	existence and applicability of MALLES as a unified simulation framework	0.18
We introduce a preference learning paradigm in which LLMs are economically aligned via post-training on extensive, heterogeneous transaction records across diverse product categories. Skill Acquisition	positive	high	ability of models to internalize consumer preferences via post-training	0.18
This preference-learning approach enables the models to internalize and transfer latent consumer preference patterns, thereby mitigating the data sparsity issues prevalent in individual categories. Skill Acquisition	positive	medium	mitigation of data sparsity through cross-category preference transfer	0.11
To enhance simulation stability, we implement a mean-field mechanism designed to model the dynamic interactions between the product environment and customer populations, effectively stabilizing sampling processes within high-dimensional decision spaces. Organizational Efficiency	positive	high	simulation stability / stabilized sampling processes	0.18
We propose a multi-agent discussion framework wherein specialized agents collaboratively process extensive product information, distributing cognitive load to alleviate single-agent attention bottlenecks and capturing critical decision factors through structured dialogue. Task Allocation	positive	high	reduction of single-agent attention bottlenecks / distributed processing of product information	0.18
Experiments demonstrate that our framework achieves significant improvements in product selection accuracy compared to existing economic and financial LLM simulation baselines. Decision Quality	positive	medium	product selection accuracy	0.11
Experiments demonstrate that our framework achieves significant improvements in purchase quantity prediction compared to existing economic and financial LLM simulation baselines. Decision Quality	positive	medium	purchase quantity prediction accuracy	0.11
Experiments demonstrate that our framework achieves improved simulation stability compared to existing economic and financial LLM simulation baselines. Organizational Efficiency	positive	medium	simulation stability	0.11
Our results substantiate the potential of large language models as a foundational pillar for high-fidelity, scalable decision simulation and latter analysis in the real economy based on foundational database. Research Productivity	positive	medium	potential of LLMs for high-fidelity, scalable decision simulation	0.02