A multi‑agent LLM sandbox that learns cross‑category consumer preferences and uses mean‑field interactions improves simulated product choice and purchase forecasts, offering a promising platform for scalable market simulation—though its claims rest on simulated environments and depend heavily on the training transaction data.
In the real economy, modern decision-making is fundamentally challenged by high-dimensional, multimodal environments, which are further complicated by agent heterogeneity and combinatorial data sparsity. This paper introduces a Multi-Agent Large Language Model-based Economic Sandbox (MALLES), leveraging the inherent generalization capabilities of large-sacle models to establish a unified simulation framework applicable to cross-domain and cross-category scenarios. Central to our approach is a preference learning paradigm in which LLMs are economically aligned via post-training on extensive, heterogeneous transaction records across diverse product categories. This methodology enables the models to internalize and transfer latent consumer preference patterns, thereby mitigating the data sparsity issues prevalent in individual categories. To enhance simulation stability, we implement a mean-field mechanism designed to model the dynamic interactions between the product environment and customer populations, effectively stabilizing sampling processes within high-dimensional decision spaces. Furthermore, we propose a multi-agent discussion framework wherein specialized agents collaboratively process extensive product information. This architecture distributes cognitive load to alleviate single-agent attention bottlenecks and captures critical decision factors through structured dialogue. Experiments demonstrate that our framework achieves significant improvements in product selection accuracy, purchase quantity prediction, and simulation stability compared to existing economic and financial LLM simulation baselines. Our results substantiate the potential of large language models as a foundational pillar for high-fidelity, scalable decision simulation and latter analysis in the real economy based on foundational database.
Summary
Main Finding
MALLES (Multi-Agent LLMs-based Economic Sandbox) shows that large language models, when economically aligned via post-training on heterogeneous transaction records and deployed inside a structured multi-agent + mean-field architecture, can substantially reduce category-level data sparsity and improve fidelity of retail and wholesale decision simulations. The framework reports statistically significant gains in product-selection accuracy, purchase-quantity prediction, and simulation stability versus prior LLM-based economic simulators.
Key Points
- Cross-category post-training: LLMs are post-trained on large, heterogeneous transaction logs so they internalize latent consumer preference patterns that transfer across product categories — mitigating per-category data sparsity and improving OOD generalization.
- Multi-agent discussion: Long, high-dimensional product contexts are handled by a role-based multi-agent dialogue (e.g., dealer, service, manufacturer for wholesale). Agents distribute cognitive load, compress context into salient decision factors, and produce interpretable, structured decisions.
- Retail vs wholesale flows: Retail agents emphasize need-satisfaction, price/discount sensitivity, and profile summaries; wholesale uses multi-round dialogues, numerical optimization, and symbolic regression to derive profit-oriented purchasing formulas.
- Stabilization mechanisms:
- Mean-field alternation: macro-level response variable µt is iteratively updated from agent outputs to stabilize population-level sampling and align simulated distributions with real ones.
- Consistency regularization and multi-sampling: generate multiple perturbed responses and penalize inconsistency (Lcons) to reduce output randomness.
- Attention control: attention-matching loss (Lattn = E[KL(A || A*)]) enforces attention priors toward economic/numeric features.
- Calibration: post-process or reweight Psim to match Preal by minimizing DKL(Preal || f(Psim)) or using Wasserstein distance; mapping f and weights w(X,Z) are used to correct distributional gaps.
- Interpretability: symbolic regression is embedded in the multi-agent loop to extract compact, human-readable decision formulas consistent with observed transaction data.
- Hybrid cognitive components: optional cognitive core (e.g., ACT-R style) can generate candidate rule sets that the LLM evaluates, improving rule-based stability and interpretability.
Data & Methods
- Data inputs:
- Transaction records (timestamped orders with customer IDs, product IDs, quantities, unit prices, discounts, channels, review scores).
- Product metadata and multimodal attributes (IDs, categories, base prices, attribute text, image embeddings).
- Dialogue logs (negotiations, agent roles, outcomes) for wholesale scenarios.
- Customer profiles (income brackets, buyer types, historical purchase summaries).
- Problem setup and objective:
- True decision function ai = D(Xobs_i, Xhid_i, ρi); simulator produces ˆai = ˆD(Xobs_i, Zi, ˆρi; θ) where Zi approximates hidden context and θ are model parameters.
- Training minimizes expected loss E = E_i[ℓ(ai, ˆai)] (classification or continuous losses as appropriate).
- Core methods:
- Post-training LLMs on pooled, cross-category transaction data to learn transferable preference representations and numerical sensitivity.
- Profile summarization module constructs Zi capturing long-term behavior, promotion sensitivity, brand affinity, etc., used to approximate hidden variables.
- Attention priors and attention-matching regularizer to prioritize numeric/economic inputs.
- Multi-agent role-based dialogue with iterative rounds and final parsing functions ϕretail / ϕwholesale to extract structured decisions (selection br/sw and quantity qr/qw).
- Mean-field alternating updates to align micro decisions with macro distributions and reduce population-level drift.
- Consistency loss computed over small input perturbations to reduce stochasticity.
- Calibration step via mapping f and reweighting w(X,Z) using divergence metrics (KL, Wasserstein).
- Integration of symbolic regression for compact rule discovery and interpretability.
- Baselines and evaluation:
- Compared against recent LLM-based economic simulators and financial LLM frameworks (e.g., EconAgent, ABIDES-Economist, FinCon and other cited LLM baselines).
- Metrics reported include product selection accuracy, purchase quantity prediction error, and simulation stability (population-level distributional alignment). (Paper reports statistically significant improvements; exact datasets appear proprietary/transactional.)
Implications for AI Economics
- Practical simulation for business decisions: MALLES provides a pathway to scalable, high-fidelity simulation of retail and wholesale decision-making that better handles long-tail categories and limited per-category data — useful for pricing, promotion design, inventory planning and procurement.
- Better numerical sensitivity and multimodal alignment: the combination of post-training on transactional data plus attention priors addresses a common weakness of LLM agents (poor numeric sensitivity), making LLMs more actionable for economic policy and operational decisions.
- Interpretability and theory discovery: embedding symbolic regression in multi-agent reasoning allows extraction of compact decision formulas, bridging black-box LLM outputs and economic theory / managerial rules.
- Population-level consistency via mean-field methods: modeling interactions between micro agents and macro distributions is critical when using LLMs for population simulations; this approach reduces accumulation of micro errors into misleading macro patterns.
- Research directions and needs:
- Benchmarking: standardized, open benchmarks are needed to validate OOD generalization and numerical sensitivity across realistic multimodal datasets.
- Data and reproducibility: the approach depends on rich transaction logs (often proprietary). Open datasets or synthetic population generators will be important for reproducible research.
- Computational and deployment trade-offs: multi-agent dialogues, multi-sampling, and mean-field iterations increase compute cost; research should explore efficient approximations and calibration schemes.
- Policy and fairness: simulation-driven decisions (pricing, inventory allocation) require careful auditing for fairness and economic externalities when models are trained on historical transactional data that may embed biases.
Overall, MALLES demonstrates a practical architecture for bringing LLMs closer to applied economic simulation by combining cross-category post-training, structured multi-agent reasoning, and stabilization/calibration layers to improve fidelity and interpretability in retail and wholesale settings.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| This paper introduces a Multi-Agent Large Language Model-based Economic Sandbox (MALLES) as a unified simulation framework applicable to cross-domain and cross-category scenarios. Other | positive | high | existence and applicability of MALLES as a unified simulation framework |
0.18
|
| We introduce a preference learning paradigm in which LLMs are economically aligned via post-training on extensive, heterogeneous transaction records across diverse product categories. Skill Acquisition | positive | high | ability of models to internalize consumer preferences via post-training |
0.18
|
| This preference-learning approach enables the models to internalize and transfer latent consumer preference patterns, thereby mitigating the data sparsity issues prevalent in individual categories. Skill Acquisition | positive | medium | mitigation of data sparsity through cross-category preference transfer |
0.11
|
| To enhance simulation stability, we implement a mean-field mechanism designed to model the dynamic interactions between the product environment and customer populations, effectively stabilizing sampling processes within high-dimensional decision spaces. Organizational Efficiency | positive | high | simulation stability / stabilized sampling processes |
0.18
|
| We propose a multi-agent discussion framework wherein specialized agents collaboratively process extensive product information, distributing cognitive load to alleviate single-agent attention bottlenecks and capturing critical decision factors through structured dialogue. Task Allocation | positive | high | reduction of single-agent attention bottlenecks / distributed processing of product information |
0.18
|
| Experiments demonstrate that our framework achieves significant improvements in product selection accuracy compared to existing economic and financial LLM simulation baselines. Decision Quality | positive | medium | product selection accuracy |
0.11
|
| Experiments demonstrate that our framework achieves significant improvements in purchase quantity prediction compared to existing economic and financial LLM simulation baselines. Decision Quality | positive | medium | purchase quantity prediction accuracy |
0.11
|
| Experiments demonstrate that our framework achieves improved simulation stability compared to existing economic and financial LLM simulation baselines. Organizational Efficiency | positive | medium | simulation stability |
0.11
|
| Our results substantiate the potential of large language models as a foundational pillar for high-fidelity, scalable decision simulation and latter analysis in the real economy based on foundational database. Research Productivity | positive | medium | potential of LLMs for high-fidelity, scalable decision simulation |
0.02
|