A procedurally generated strategic benchmark reveals that leading LLMs, while similar in overall strength, differ sharply in the kinds of strategic competence they possess — and some top models are markedly more locally volatile. GENSTRAT's capability profiles and a new jaggedness measure expose deployment risks that simple leaderboard scores miss.

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

Vartan Shadarevian, Kia Ghods, Alex Kenich, Anany Kotawala · May 22, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

GENSTRAT, a procedurally generated benchmark of imperfect-information games plus a capability-profile and jaggedness metric, shows frontier LLMs differ not only in average strategic strength but also in qualitative capability trade-offs and local volatility that aggregate rankings hide.

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

Summary

Main Finding

GENSTRAT introduces a procedurally generated benchmark (generalized betting games, GBGs) and a multi-axis diagnostic methodology that together let evaluators (a) draw fresh, uncontaminated strategic environments on demand, (b) decompose LLM strategic competence across six interpretable axes, and (c) detect local performance volatility via a jaggedness metric. In a 9-model tournament (36,937 matches) GENSTRAT separates models on mean strategic strength but, importantly, reveals qualitatively different capability profiles and local volatility: models with near-identical overall strength (e.g., gpt-5 and gemini-3.1) can differ substantially in brittleness and jaggedness, which has direct deployment implications.

Key Points

Procedural strategic environments:
- GBGs are two-player zero-sum imperfect-information card games whose phase graph and rule components are randomized (deck, phases, observability, auctions, simultaneous moves, conditional branches, etc.). Every game is deterministic from an integer seed.
- Generator produces a large raw pool; quality gates (Monte Carlo) accept only playable, informative games, making the benchmark resistant to corpus contamination.
Six-axis capability profiling (Monte Carlo–measured):
State space — combinatorial size (log10 distinct observable info states).
Temporal depth — degree to which early moves affect later payoffs.
Information sensitivity — how much optimal moves depend on private information.
Opponent modeling — how much best response shifts with opponent policy.
Risk — tradeoff between EV and downside safety.
Brittleness — sensitivity of payoffs to small (3%) policy perturbations.
Jaggedness metric:
- Measures within-distribution smoothness by quantifying how win-margin residuals jump between strategically similar games; signals local volatility where a model’s performance can change sharply for near-neighbor games.
Benchmark construction and tournament:
- From ~12,351 candidate seeds, 2,000 games passed Monte Carlo quality checks; 50 games were selected via farthest-point sampling across the 6-axis embedding to ensure coverage.
- Tournament: 9 frontier/open models, 36,937 match slots, each model plays each game against at least two opponents, 40 matches per model-pair-game (20 per seat). Prompting requires JSON-terminated actions with lenient parsing and rare random-action fallbacks.
- Estimation: additive paired-comparison on signed chip margins (ys = α_i − α_j + εs) with sum-to-zero constraint; paired-cluster bootstrap (B=2000) for CIs.
Empirical outcomes:
- Leaderboard (chips/game): gpt-5-4-high +0.85 [0.74,0.96], gemini-3.1-pro-preview +0.83 [0.76,0.91], claude-sonnet-4-6-max +0.64 [0.52,0.75]. GENSTRAT separates models across ~3 chips/game range.
- Qualitative differences: gemini-3.1-pro-preview shows broad gains across axes and is locally smooth; gpt-5 is top by mean margin but among the most locally jagged; claude’s improvements concentrate on brittleness axis.
- Thinking-mode ablation: increasing “reasoning” budget yields positive point estimates of chip-margin gains across model families; two families show statistically robust gains, others underpowered but positive—suggesting extra deliberation improves strategic returns.

Data & Methods

Game generation and quality gates:
- Monte Carlo random-agent simulations (2,000 episodes per candidate) used to filter candidates on three criteria:
- Average number of moves per player ≤ 10.
- Every phase fires in ≥5% of episodes, with at most 30% of phases allowed to fall below that.
- For conditional branches, no more than 34% may remain dead across the MC run.
- 12,351 candidates → ~2,000 accepted pool → 50-game benchmark selected with farthest-point sampling (FPS) on normalized 6-axis vectors.
Axis measurements:
- Derived from Monte Carlo traces; formulas in appendices (paper). Axes are mostly weakly correlated; VIFs below conservative thresholds.
Tournament design:
- Seat effects controlled by running matches with both seat assignments and paired play seeds so chance draws are shared across seat assignments.
- Parser: strict JSON required but lenient fallback recovers small violations; if parsing fails, a uniform random legal action is chosen (fallback rates ≤ 0.5%).
Strength estimation and inference:
- Continuous-margin paired-comparison model (analogue of Bradley–Terry using signed chip margins).
- Clustering for bootstrap resamples uses (game seed, unordered model pair, run id) so paired matches share randomness in resamples.
- Bootstrapped 95% CIs reported for model strengths.
Jaggedness:
- Constructed from residuals of the strength model across similar-game neighborhoods; higher jaggedness = larger local fluctuations in win-margin residuals between strategically similar games.

Implications for AI Economics

Better deployment diagnostics than single-number benchmarks:
- Average win-rate or single leaderboard position masks important heterogeneity (risk preference, opponent adaptivity, brittleness, local volatility). GENSTRAT’s per-axis profiles plus jaggedness help identify models suitable for particular economic roles (e.g., price-setting agents, bidding agents, market-makers).
Avoiding contamination and enabling stress-testing:
- Procedural generation allows fresh, uncontaminated game draws after any amount of offline tuning. For economic deployments exposed to adversarial or distribution-shifted counterparts, periodic fresh-draw evaluations provide a robust safety check.
Managing local volatility risk:
- Two models with similar mean performance can exhibit very different jaggedness: a locally jagged model may perform well on average but fail badly on specific, strategically nearby environments. In markets/auctions this can create catastrophic edge-case outcomes (unexpected collusion/anti-competitive behavior, failed negotiation responses, or large downside exposure). Deployers should prefer smoother models in high-stakes contexts or at least quantify and hedge jaggedness.
Tailoring model choice to task-specific axes:
- Use the six axes to match model strengths to application needs:
  - Opponent-rich/strategic adaptation tasks → prioritize models scoring high on opponent modeling axis.
  - High-stakes downside sensitivity (e.g., regulatory compliance, auctions with severe penalties) → prefer low-risk and low-brittleness profiles.
  - Long-horizon planning (multi-stage markets) → prioritize temporal-depth competence.
Operational recommendations:
- Incorporate procedural strategic evaluations into pre-deployment testing, with per-axis acceptance thresholds and a jaggedness tolerance.
- Run thinking-mode/compute-budget ablations: extra reasoning budget can improve strategic returns; calibrate reasoning/config knobs where allowed.
- Monitor post-deployment for performance drift across axes and for emergent jaggedness—re-evaluate with fresh procedural draws periodically.
- For regulators and auditors: require multi-axis strategic evaluation and local-volatility metrics (not only mean performance) for models intended as economic agents.

Overall, GENSTRAT supplies a reproducible, scalable framework to move from “how strong is the model on canonical games?” to “how will this model behave across a distribution of economically relevant strategic environments, and how brittle or locally volatile is that behavior?” This matters for safe, predictable deployment of LLMs as economic agents.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides robust, large-scale empirical evaluation within a well-defined synthetic distribution (2,000 generated games, 50 sampled benchmarks, >36,000 matches), so its findings about relative model performance and within-distribution volatility are well-supported; however, the evidence is limited to procedurally generated two-player zero-sum card games and does not establish causal effects on real-world economic outcomes or on broader classes of strategic environments. Methods Rigorhigh — Methodology is systematic and reproducible: a configurable generator creates a large game pool, a clear capability-profile decomposes competence along six axes, jaggedness quantifies local volatility, and extensive head-to-head play across nine models (tens of thousands of matches) gives statistical power; remaining concerns are sampling choices (50/2,000), sensitivity to prompt/temperature and model API settings, and the restriction to a particular game family. SampleProcedurally generated distribution of two-player zero-sum imperfect-information card games (2,000 games total), with 50 benchmark games sampled from that pool; nine frontier and open-weight LLMs evaluated in a head-to-head tournament totaling over 36,000 matches; capability profiles computed across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, brittleness). Themesadoption human_ai_collab GeneralizabilitySynthetic game family (imperfect-information card games) may not reflect strategic structure of real-world marketplaces, auctions, or multi-agent economic settings, Only two-player zero-sum interactions considered; cooperative, multi-agent, or non-zero-sum economic environments are not covered, Findings depend on prompt engineering, model hyperparameters, and API behaviour (temperature, sampling) which can vary across deployments, Sampled benchmark size (50 games) may not capture full diversity of the 2,000-game distribution or other plausible game generators, Only nine models tested; results may not generalize to other or future model architectures and fine-tuning regimes, Offline head-to-head matches omit dynamics of learning in live deployments, human-in-the-loop interactions, and market-level feedback effects

Claims (12)

Claim	Direction	Confidence	Outcome	Details
Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Adoption Rate	positive	high	deployment of LLMs as economic agents	0.09
Existing strategic-reasoning benchmarks evaluate models on fixed canonical games and may saturate as the frontier improves and fail to generalize to varied real-world strategic environments. Other	negative	high	benchmark generalizability / benchmark saturation	0.03
We introduce GENSTRAT, which uses procedurally generated strategic environments to address the limitations of fixed benchmarks. Other	positive	high	availability of procedurally generated strategic environments for evaluation	0.18
GENSTRAT generates a distribution of two-player zero-sum imperfect-information card games. Other	neutral	high	game distribution (two-player zero-sum imperfect-information card games)	n=2000 0.18
The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. Other	positive	high	freshness/resistance-to-contamination of benchmarks	0.18
We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). Other	positive	high	decomposed capability profile across six axes	0.18
We introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. Output Quality	positive	high	within-distribution smoothness / local volatility (jaggedness)	0.18
We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Other	null_result	high	evaluation sample size / tournament scale (matches run)	n=36000 0.3
Newer frontier-tier models score higher on average. Output Quality	positive	high	average model score / overall strength	n=9 0.18
Models with near-identical overall strength show qualitatively different capability profiles. Output Quality	mixed	high	differences in capability-profile axes (state space, temporal depth, information sensitivity, opponent modeling, risk, brittleness)	n=9 0.18
Two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Output Quality	negative	high	local volatility / jaggedness	n=3 0.18
Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide. Organizational Efficiency	positive	high	diagnostic usefulness for deployment decisions	0.18