← Papers

In a timed multi-stage Risk tournament Gemini-3.1-pro-preview won 20 of 32 matches against three rivals, but when all planners' outputs were executed on a common scaffold planner performance was statistically indistinguishable. The result implies that system-level execution, objective-tracking, and runtime reliability — not planning ability alone — determine live-agent effectiveness.

Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play

H. C. Ekne · May 21, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

In timed multi-phase Risk matches, gemini-3.1-pro-preview won 20/32 games against three competitors, but when execution was standardized the planners performed similarly—implying that end-to-end execution behavior (objective tracking, execution conversion, runtime reliability and cost) drives much of the observed provider spread.

Static benchmarks capture only part of how large language models behave in practice. Real systems place models inside repeated loops with time limits, formatting constraints, and failure modes. We study this setting in a timed multi-phase Risk environment with explicit victory targets and repeated planning and execution cycles. In a replicated 32-game cross-provider championship under frozen rules, gemini-3.1-pro-preview won 20 of 32 games against gpt-5.1, claude-opus-4-7, and kimi-k2.6, and the pooled winner distribution differs strongly from an equal-strength null (p approx 1.5 x 10^-5). We then separate planning from execution by standardizing execution on a cheaper Gemini Flash scaffold. Under this design, a pooled 32-game planner bakeoff is consistent with near-equality (p approx 0.821), which indicates that much of the earlier provider spread came from end-to-end system behavior rather than planning alone. To study mechanism, we analyze saved planning and execution traces from the provider championship. Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches. Gemini also converts more turns into deep conquest chains, even though it is not the cleanest runtime. These results show that live-agent performance depends on objective tracking, execution conversion, cost, and runtime reliability, and they support evaluating LLMs as components in bounded workflows rather than as isolated benchmark respondents.

Summary

Main Finding

When evaluated as live strategic agents in a timed, multi-phase Risk environment, end-to-end system behavior (planning + execution, runtime, cost, and failure modes) drives outcomes more than one-shot benchmark-style capabilities. In a replicated 32-game cross‑provider championship under a frozen timed protocol, gemini-3.1-pro-preview dominated (20/32 wins, pooled equal‑strength null p ≈ 1.5×10−5). However, when execution was standardized on a cheap Gemini Flash scaffold (planner-only bakeoff), planner win rates compressed and were statistically consistent with near‑equality (pooled 32‑game omnibus p ≈ 0.821). Trace analyses show Gemini’s advantage came from (a) much more explicit objective/goal tracking in its plans and (b) converting turns into deeper conquest chains during execution. A hybrid architecture (strong planner + cheap executor) kept most of the strength while cutting API cost by more than half.

Key Points

Core empirical outcomes
- Full-stack cross‑provider championship (32 games, pooled from two independent 16‑game blocks): Gemini 20, OpenAI rep 6, Claude 4, Kimi 2. Omnibus equal‑winner test p ≈ 1.5×10−5; pairwise two‑sided p-values separating Gemini from GPT‑5.1, Claude, and Kimi were ≈0.00936, ≈0.00154, and ≈0.000121 respectively.
- Planner-only bakeoff with execution fixed to Gemini 3 Flash (32 games): Claude+Flash 10 wins, Gemini+Flash 8, GPT‑5.5+Flash 8, Kimi+Flash 6 — pooled result consistent with near equality (p ≈ 0.821).
- Gemini execution cost gate (15 games): hybrid (Gemini 3.1 plan + Gemini 3 Flash exec) won 8/15; gemini-3.1-pro-full won 4/15; gemini-3-flash-full won 3/15. Estimated total cost: $6.28 (3.1-pro-full), $2.80 (hybrid), $2.08 (flash full). Hybrid cut cost by >50% relative to the expensive full stack while retaining most wins.
Trace/mechanism findings
- Goal-tracking language: Gemini used explicit endgame-goal language in 58.5% of saved plans (quantified goal language in 54.5%); Claude ≈3%, Kimi ≈1–1.4%, GPT‑5.1 ≈0.4–0.8%. Gemini’s goal references increased as it neared victory (≈39.8% at 0–9 territories, 69.9% at 10–19, 100% at 20+).
- Turn conversion / execution depth: Gemini produced deep conquest chains (6+ successful conquests) on 38.7% of turns; Claude 28.9%, Kimi 26.7%, GPT‑5.1 23.4%. Midgame territory gains per turn (when starting with 10–19 territories): Gemini ≈5.36, Kimi ≈4.50, GPT‑5.1 ≈4.15, Claude ≈4.07.
- Turns that included explicit endgame goals averaged ~5.19 territories gained vs ~4.08 without — correlational, not causal proof.
Operational observations
- Live loops penalize overlong planning, format brittleness, timeouts, and weak plan→execution conversion. These operational factors explain much of the provider spread in full‑stack play.
- Decomposed architectures (strong/expensive planner + cheap/fast executor) are cost‑effective in repeated-action workflows.
Reproducibility
- Runs, traces, scripts, and artifacts are available in the repository and public Dropbox links (paper includes exact paths and scripts).

Data & Methods

Domain and harness
- Environment: multi‑player Risk engine with standardized legality assistance and output grammar.
- Timers: 90s planning, 90s execution turn, 15s placement. Victory target used in main runs: 65% territory control (also tracked final territory totals).
- Turn loop: pre‑turn planning, card trade, troop placement, repeated attacks, fortify. Seat rotation enabled.
Experiments
- Core experiments: 32‑game cross‑provider championship (two 16‑game blocks), 32‑game planner bakeoff (execution standardized to Gemini Flash), 15‑game Gemini execution cost gate, 3×16‑game Kimi anchor comparisons, plus detailed trace analyses (946 saved turn summaries analyzed).
Endpoints and statistics
- Primary endpoint: wins under configured victory condition.
- Secondary endpoints: territory totals, successful/failed attacks, attack‑turn rate, fallback counts, invalid move rates, execution chain depth, trace rubric, and estimated API cost (when logging allowed).
- Tests: winner‑label permutation tests/Monte Carlo checks for omnibus comparisons; exact binomial for pairwise; descriptive and trace statistics for mechanisms. Reported p-values and win vectors for key comparisons.
Trace analysis
- Textual analysis of saved plan.text and turn summaries; counted explicit/quantified goal language, normalized by plan length and board state (territory counts), and aggregated execution chain depth and conversion metrics.
Limitations (methodological)
- Single domain (Risk) and one main timer/regime; one legality assistance strategy and prompt grammar family.
- Observational trace work: only visible text analyzed, not internal latent reasoning.
- Generalization to other domains, timer regimes, and different prompt/assist strategies is limited.

Implications for AI Economics

Value is systemic, not just model-level. Benchmark rankings that treat models as isolated responders can misstate deployment value because they omit runtime, format adherence, failure rates, latency, and cost under repeated use. Procurement and investment decisions should evaluate models as components in workflows with budgeted repeated actions.
Cost-performance tradeoffs favor modular/hybrid architectures. A stronger planner paired with a cheaper executor preserved most win probability at materially lower per‑game API cost. For production systems that require repeated actions (high throughput or many turns), separating expensive reasoning from cheaper execution can substantially improve ROI.
Pricing and product design matter for real-world competitiveness. Models that are marginally better on static benchmarks may be less valuable if they are slower, more failure‑prone, or much costlier when used repeatedly in time‑bounded loops.
Benchmark design should include operational metrics. Evaluations for economic decisions should incorporate runtime reliability, format compliance, timeout/fallback rates, conversion efficiency (plan→action), and per‑task cost under realistic repeated use.
Competitive positioning and diffusion: apparent parity on static benchmarks can mask operational gaps; buyers and integrators should require live-workflow evaluations (timed, repeated phases, and execution scaffolds) when comparing providers for agentic or automation use cases.
Policy and procurement recommendations: include hybrid/architectural evaluations and cost-per‑unit‑of‑outcome metrics (e.g., cost per win / cost per successful action) in RFPs and model audits; require trace logs to assess objective‑tracking behavior and failure recovery.
Research priorities: develop standardized agentic benchmarks that measure not just final answers but plan‑to‑execution conversion, format robustness, and per‑action cost under time budgets; study causality between explicit goal tracking in plans and downstream conversion/utility.

If you want, I can: - Produce a one‑page slide summarizing these points for a meeting. - Extract the exact numerical tables (wins, p‑values, costs) into CSV format for your analysis. - Propose an economic evaluation checklist for model procurement that operationalizes these findings.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper uses a controlled, replicated experimental setup and a clear isolation exercise (standardized execution) that supports causal claims about planner versus execution contributions within the game environment; however, the total number of games is modest (32), potential non-independence of matches and provider runtime variability could bias results, and the findings are constrained to a synthetic, timed Risk environment and specific model versions, limiting external validity. Methods Rigormedium — The design includes replication, pre-specified frozen rules, hypothesis tests, and a targeted decomposition (planner vs execution) which are strengths; but the paper does not report (in the summary) details on randomization procedures, power calculations, controls for match-level dependence, robustness checks or corrections for multiple comparisons, and runtime/cost confounds that could affect fairness between providers. SampleSynthetic experimental dataset from a timed, multi-phase Risk-style environment: a replicated 32-game cross-provider championship among four model endpoints (gemini-3.1-pro-preview, gpt-5.1, claude-opus-4-7, kimi-k2.6) under frozen rules; additional 32-game planner bakeoff where planning outputs from each provider were executed on a common, cheaper Gemini Flash scaffold; saved planning and execution traces for trace-level mechanistic analysis. Themeshuman_ai_collab productivity IdentificationControlled, replicated gameplay comparisons across providers with frozen rules and time limits; statistical test of the pooled winner distribution against an equal-strength null for overall differences; a secondary ‘‘planner bakeoff’’ that standardizes execution on a single (Gemini Flash) scaffold to isolate planning ability from end-to-end execution; mechanistic inference from saved planning and execution traces (frequency of objective mentions, conversion of turns into conquest chains) to explain observed performance differences. GeneralizabilityFindings are specific to a synthetic, timed Risk environment and may not generalize to real-world tasks or other task genres., Only four provider model versions were evaluated; results may change with model updates or other architectures., Small number of games (32) limits statistical power for granular comparisons and robustness to stochasticity., Standardizing execution on a Gemini Flash scaffold may itself favor or penalize certain planners, limiting neutrality of the isolation., Runtime reliability, cost structure, and API behavior differ across real production deployments and over time, reducing temporal and ecological validity.

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Static benchmarks capture only part of how large language models behave in practice. Other	negative	high	coverage_of_model_behavior_by_static_benchmarks	0.48
We study this setting in a timed multi-phase Risk environment with explicit victory targets and repeated planning and execution cycles. Other	null_result	high	experimental_environment_description	0.8
In a replicated 32-game cross-provider championship under frozen rules, gemini-3.1-pro-preview won 20 of 32 games against gpt-5.1, claude-opus-4-7, and kimi-k2.6, and the pooled winner distribution differs strongly from an equal-strength null (p approx 1.5 x 10^-5). Other	positive	high	wins / pooled winner distribution	n=32 20 of 32 games (p approx 1.5 x 10^-5) 0.8
When execution is standardized on a cheaper Gemini Flash scaffold (separating planning from execution), a pooled 32-game planner bakeoff is consistent with near-equality (p approx 0.821). Other	null_result	high	planner performance equality (pooled test)	n=32 p approx 0.821 0.8
Much of the earlier provider spread came from end-to-end system behavior rather than planning alone. Other	mixed	high	source_of_provider_performance_spread (end-to-end vs planning)	n=32 0.48
Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches. Decision Quality	positive	medium	frequency_of_terminal_objective_references (objective tracking)	0.29
Gemini converts more turns into deep conquest chains, even though it is not the cleanest runtime. Output Quality	positive	medium	conversion_rate_of_turns_into_deep_conquest_chains	0.29
Live-agent performance depends on objective tracking, execution conversion, cost, and runtime reliability, supporting evaluation of LLMs as components in bounded workflows rather than as isolated benchmark respondents. Other	positive	high	factors_affecting_live-agent_performance	0.48