A procedurally generated strategic benchmark reveals that leading LLMs, while similar in overall strength, differ sharply in the kinds of strategic competence they possess — and some top models are markedly more locally volatile. GENSTRAT's capability profiles and a new jaggedness measure expose deployment risks that simple leaderboard scores miss.
Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.
Summary
Main Finding
GENSTRAT introduces a procedurally generated benchmark (generalized betting games, GBGs) and a multi-axis diagnostic methodology that together let evaluators (a) draw fresh, uncontaminated strategic environments on demand, (b) decompose LLM strategic competence across six interpretable axes, and (c) detect local performance volatility via a jaggedness metric. In a 9-model tournament (36,937 matches) GENSTRAT separates models on mean strategic strength but, importantly, reveals qualitatively different capability profiles and local volatility: models with near-identical overall strength (e.g., gpt-5 and gemini-3.1) can differ substantially in brittleness and jaggedness, which has direct deployment implications.
Key Points
-
Procedural strategic environments:
- GBGs are two-player zero-sum imperfect-information card games whose phase graph and rule components are randomized (deck, phases, observability, auctions, simultaneous moves, conditional branches, etc.). Every game is deterministic from an integer seed.
- Generator produces a large raw pool; quality gates (Monte Carlo) accept only playable, informative games, making the benchmark resistant to corpus contamination.
-
Six-axis capability profiling (Monte Carlo–measured):
- State space — combinatorial size (log10 distinct observable info states).
- Temporal depth — degree to which early moves affect later payoffs.
- Information sensitivity — how much optimal moves depend on private information.
- Opponent modeling — how much best response shifts with opponent policy.
- Risk — tradeoff between EV and downside safety.
-
Brittleness — sensitivity of payoffs to small (3%) policy perturbations.
-
Jaggedness metric:
- Measures within-distribution smoothness by quantifying how win-margin residuals jump between strategically similar games; signals local volatility where a model’s performance can change sharply for near-neighbor games.
-
Benchmark construction and tournament:
- From ~12,351 candidate seeds, 2,000 games passed Monte Carlo quality checks; 50 games were selected via farthest-point sampling across the 6-axis embedding to ensure coverage.
- Tournament: 9 frontier/open models, 36,937 match slots, each model plays each game against at least two opponents, 40 matches per model-pair-game (20 per seat). Prompting requires JSON-terminated actions with lenient parsing and rare random-action fallbacks.
- Estimation: additive paired-comparison on signed chip margins (ys = α_i − α_j + εs) with sum-to-zero constraint; paired-cluster bootstrap (B=2000) for CIs.
-
Empirical outcomes:
- Leaderboard (chips/game): gpt-5-4-high +0.85 [0.74,0.96], gemini-3.1-pro-preview +0.83 [0.76,0.91], claude-sonnet-4-6-max +0.64 [0.52,0.75]. GENSTRAT separates models across ~3 chips/game range.
- Qualitative differences: gemini-3.1-pro-preview shows broad gains across axes and is locally smooth; gpt-5 is top by mean margin but among the most locally jagged; claude’s improvements concentrate on brittleness axis.
- Thinking-mode ablation: increasing “reasoning” budget yields positive point estimates of chip-margin gains across model families; two families show statistically robust gains, others underpowered but positive—suggesting extra deliberation improves strategic returns.
Data & Methods
-
Game generation and quality gates:
- Monte Carlo random-agent simulations (2,000 episodes per candidate) used to filter candidates on three criteria:
- Average number of moves per player ≤ 10.
- Every phase fires in ≥5% of episodes, with at most 30% of phases allowed to fall below that.
- For conditional branches, no more than 34% may remain dead across the MC run.
- 12,351 candidates → ~2,000 accepted pool → 50-game benchmark selected with farthest-point sampling (FPS) on normalized 6-axis vectors.
-
Axis measurements:
- Derived from Monte Carlo traces; formulas in appendices (paper). Axes are mostly weakly correlated; VIFs below conservative thresholds.
-
Tournament design:
- Seat effects controlled by running matches with both seat assignments and paired play seeds so chance draws are shared across seat assignments.
- Parser: strict JSON required but lenient fallback recovers small violations; if parsing fails, a uniform random legal action is chosen (fallback rates ≤ 0.5%).
-
Strength estimation and inference:
- Continuous-margin paired-comparison model (analogue of Bradley–Terry using signed chip margins).
- Clustering for bootstrap resamples uses (game seed, unordered model pair, run id) so paired matches share randomness in resamples.
- Bootstrapped 95% CIs reported for model strengths.
-
Jaggedness:
- Constructed from residuals of the strength model across similar-game neighborhoods; higher jaggedness = larger local fluctuations in win-margin residuals between strategically similar games.
Implications for AI Economics
-
Better deployment diagnostics than single-number benchmarks:
- Average win-rate or single leaderboard position masks important heterogeneity (risk preference, opponent adaptivity, brittleness, local volatility). GENSTRAT’s per-axis profiles plus jaggedness help identify models suitable for particular economic roles (e.g., price-setting agents, bidding agents, market-makers).
-
Avoiding contamination and enabling stress-testing:
- Procedural generation allows fresh, uncontaminated game draws after any amount of offline tuning. For economic deployments exposed to adversarial or distribution-shifted counterparts, periodic fresh-draw evaluations provide a robust safety check.
-
Managing local volatility risk:
- Two models with similar mean performance can exhibit very different jaggedness: a locally jagged model may perform well on average but fail badly on specific, strategically nearby environments. In markets/auctions this can create catastrophic edge-case outcomes (unexpected collusion/anti-competitive behavior, failed negotiation responses, or large downside exposure). Deployers should prefer smoother models in high-stakes contexts or at least quantify and hedge jaggedness.
-
Tailoring model choice to task-specific axes:
- Use the six axes to match model strengths to application needs:
- Opponent-rich/strategic adaptation tasks → prioritize models scoring high on opponent modeling axis.
- High-stakes downside sensitivity (e.g., regulatory compliance, auctions with severe penalties) → prefer low-risk and low-brittleness profiles.
- Long-horizon planning (multi-stage markets) → prioritize temporal-depth competence.
- Use the six axes to match model strengths to application needs:
-
Operational recommendations:
- Incorporate procedural strategic evaluations into pre-deployment testing, with per-axis acceptance thresholds and a jaggedness tolerance.
- Run thinking-mode/compute-budget ablations: extra reasoning budget can improve strategic returns; calibrate reasoning/config knobs where allowed.
- Monitor post-deployment for performance drift across axes and for emergent jaggedness—re-evaluate with fresh procedural draws periodically.
- For regulators and auditors: require multi-axis strategic evaluation and local-volatility metrics (not only mean performance) for models intended as economic agents.
Overall, GENSTRAT supplies a reproducible, scalable framework to move from “how strong is the model on canonical games?” to “how will this model behave across a distribution of economically relevant strategic environments, and how brittle or locally volatile is that behavior?” This matters for safe, predictable deployment of LLMs as economic agents.
Assessment
Claims (12)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Adoption Rate | positive | high | deployment of LLMs as economic agents |
0.09
|
| Existing strategic-reasoning benchmarks evaluate models on fixed canonical games and may saturate as the frontier improves and fail to generalize to varied real-world strategic environments. Other | negative | high | benchmark generalizability / benchmark saturation |
0.03
|
| We introduce GENSTRAT, which uses procedurally generated strategic environments to address the limitations of fixed benchmarks. Other | positive | high | availability of procedurally generated strategic environments for evaluation |
0.18
|
| GENSTRAT generates a distribution of two-player zero-sum imperfect-information card games. Other | neutral | high | game distribution (two-player zero-sum imperfect-information card games) |
n=2000
0.18
|
| The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. Other | positive | high | freshness/resistance-to-contamination of benchmarks |
0.18
|
| We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). Other | positive | high | decomposed capability profile across six axes |
0.18
|
| We introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. Output Quality | positive | high | within-distribution smoothness / local volatility (jaggedness) |
0.18
|
| We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Other | null_result | high | evaluation sample size / tournament scale (matches run) |
n=36000
0.3
|
| Newer frontier-tier models score higher on average. Output Quality | positive | high | average model score / overall strength |
n=9
0.18
|
| Models with near-identical overall strength show qualitatively different capability profiles. Output Quality | mixed | high | differences in capability-profile axes (state space, temporal depth, information sensitivity, opponent modeling, risk, brittleness) |
n=9
0.18
|
| Two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Output Quality | negative | high | local volatility / jaggedness |
n=3
0.18
|
| Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide. Organizational Efficiency | positive | high | diagnostic usefulness for deployment decisions |
0.18
|