In a timed multi-stage Risk tournament Gemini-3.1-pro-preview won 20 of 32 matches against three rivals, but when all planners' outputs were executed on a common scaffold planner performance was statistically indistinguishable. The result implies that system-level execution, objective-tracking, and runtime reliability — not planning ability alone — determine live-agent effectiveness.
Static benchmarks capture only part of how large language models behave in practice. Real systems place models inside repeated loops with time limits, formatting constraints, and failure modes. We study this setting in a timed multi-phase Risk environment with explicit victory targets and repeated planning and execution cycles. In a replicated 32-game cross-provider championship under frozen rules, gemini-3.1-pro-preview won 20 of 32 games against gpt-5.1, claude-opus-4-7, and kimi-k2.6, and the pooled winner distribution differs strongly from an equal-strength null (p approx 1.5 x 10^-5). We then separate planning from execution by standardizing execution on a cheaper Gemini Flash scaffold. Under this design, a pooled 32-game planner bakeoff is consistent with near-equality (p approx 0.821), which indicates that much of the earlier provider spread came from end-to-end system behavior rather than planning alone. To study mechanism, we analyze saved planning and execution traces from the provider championship. Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches. Gemini also converts more turns into deep conquest chains, even though it is not the cleanest runtime. These results show that live-agent performance depends on objective tracking, execution conversion, cost, and runtime reliability, and they support evaluating LLMs as components in bounded workflows rather than as isolated benchmark respondents.
Summary
Main Finding
When evaluated as live strategic agents in a timed, multi-phase Risk environment, end-to-end system behavior (planning + execution, runtime, cost, and failure modes) drives outcomes more than one-shot benchmark-style capabilities. In a replicated 32-game cross‑provider championship under a frozen timed protocol, gemini-3.1-pro-preview dominated (20/32 wins, pooled equal‑strength null p ≈ 1.5×10−5). However, when execution was standardized on a cheap Gemini Flash scaffold (planner-only bakeoff), planner win rates compressed and were statistically consistent with near‑equality (pooled 32‑game omnibus p ≈ 0.821). Trace analyses show Gemini’s advantage came from (a) much more explicit objective/goal tracking in its plans and (b) converting turns into deeper conquest chains during execution. A hybrid architecture (strong planner + cheap executor) kept most of the strength while cutting API cost by more than half.
Key Points
- Core empirical outcomes
- Full-stack cross‑provider championship (32 games, pooled from two independent 16‑game blocks): Gemini 20, OpenAI rep 6, Claude 4, Kimi 2. Omnibus equal‑winner test p ≈ 1.5×10−5; pairwise two‑sided p-values separating Gemini from GPT‑5.1, Claude, and Kimi were ≈0.00936, ≈0.00154, and ≈0.000121 respectively.
- Planner-only bakeoff with execution fixed to Gemini 3 Flash (32 games): Claude+Flash 10 wins, Gemini+Flash 8, GPT‑5.5+Flash 8, Kimi+Flash 6 — pooled result consistent with near equality (p ≈ 0.821).
- Gemini execution cost gate (15 games): hybrid (Gemini 3.1 plan + Gemini 3 Flash exec) won 8/15; gemini-3.1-pro-full won 4/15; gemini-3-flash-full won 3/15. Estimated total cost: $6.28 (3.1-pro-full), $2.80 (hybrid), $2.08 (flash full). Hybrid cut cost by >50% relative to the expensive full stack while retaining most wins.
- Trace/mechanism findings
- Goal-tracking language: Gemini used explicit endgame-goal language in 58.5% of saved plans (quantified goal language in 54.5%); Claude ≈3%, Kimi ≈1–1.4%, GPT‑5.1 ≈0.4–0.8%. Gemini’s goal references increased as it neared victory (≈39.8% at 0–9 territories, 69.9% at 10–19, 100% at 20+).
- Turn conversion / execution depth: Gemini produced deep conquest chains (6+ successful conquests) on 38.7% of turns; Claude 28.9%, Kimi 26.7%, GPT‑5.1 23.4%. Midgame territory gains per turn (when starting with 10–19 territories): Gemini ≈5.36, Kimi ≈4.50, GPT‑5.1 ≈4.15, Claude ≈4.07.
- Turns that included explicit endgame goals averaged ~5.19 territories gained vs ~4.08 without — correlational, not causal proof.
- Operational observations
- Live loops penalize overlong planning, format brittleness, timeouts, and weak plan→execution conversion. These operational factors explain much of the provider spread in full‑stack play.
- Decomposed architectures (strong/expensive planner + cheap/fast executor) are cost‑effective in repeated-action workflows.
- Reproducibility
- Runs, traces, scripts, and artifacts are available in the repository and public Dropbox links (paper includes exact paths and scripts).
Data & Methods
- Domain and harness
- Environment: multi‑player Risk engine with standardized legality assistance and output grammar.
- Timers: 90s planning, 90s execution turn, 15s placement. Victory target used in main runs: 65% territory control (also tracked final territory totals).
- Turn loop: pre‑turn planning, card trade, troop placement, repeated attacks, fortify. Seat rotation enabled.
- Experiments
- Core experiments: 32‑game cross‑provider championship (two 16‑game blocks), 32‑game planner bakeoff (execution standardized to Gemini Flash), 15‑game Gemini execution cost gate, 3×16‑game Kimi anchor comparisons, plus detailed trace analyses (946 saved turn summaries analyzed).
- Endpoints and statistics
- Primary endpoint: wins under configured victory condition.
- Secondary endpoints: territory totals, successful/failed attacks, attack‑turn rate, fallback counts, invalid move rates, execution chain depth, trace rubric, and estimated API cost (when logging allowed).
- Tests: winner‑label permutation tests/Monte Carlo checks for omnibus comparisons; exact binomial for pairwise; descriptive and trace statistics for mechanisms. Reported p-values and win vectors for key comparisons.
- Trace analysis
- Textual analysis of saved plan.text and turn summaries; counted explicit/quantified goal language, normalized by plan length and board state (territory counts), and aggregated execution chain depth and conversion metrics.
- Limitations (methodological)
- Single domain (Risk) and one main timer/regime; one legality assistance strategy and prompt grammar family.
- Observational trace work: only visible text analyzed, not internal latent reasoning.
- Generalization to other domains, timer regimes, and different prompt/assist strategies is limited.
Implications for AI Economics
- Value is systemic, not just model-level. Benchmark rankings that treat models as isolated responders can misstate deployment value because they omit runtime, format adherence, failure rates, latency, and cost under repeated use. Procurement and investment decisions should evaluate models as components in workflows with budgeted repeated actions.
- Cost-performance tradeoffs favor modular/hybrid architectures. A stronger planner paired with a cheaper executor preserved most win probability at materially lower per‑game API cost. For production systems that require repeated actions (high throughput or many turns), separating expensive reasoning from cheaper execution can substantially improve ROI.
- Pricing and product design matter for real-world competitiveness. Models that are marginally better on static benchmarks may be less valuable if they are slower, more failure‑prone, or much costlier when used repeatedly in time‑bounded loops.
- Benchmark design should include operational metrics. Evaluations for economic decisions should incorporate runtime reliability, format compliance, timeout/fallback rates, conversion efficiency (plan→action), and per‑task cost under realistic repeated use.
- Competitive positioning and diffusion: apparent parity on static benchmarks can mask operational gaps; buyers and integrators should require live-workflow evaluations (timed, repeated phases, and execution scaffolds) when comparing providers for agentic or automation use cases.
- Policy and procurement recommendations: include hybrid/architectural evaluations and cost-per‑unit‑of‑outcome metrics (e.g., cost per win / cost per successful action) in RFPs and model audits; require trace logs to assess objective‑tracking behavior and failure recovery.
- Research priorities: develop standardized agentic benchmarks that measure not just final answers but plan‑to‑execution conversion, format robustness, and per‑action cost under time budgets; study causality between explicit goal tracking in plans and downstream conversion/utility.
If you want, I can: - Produce a one‑page slide summarizing these points for a meeting. - Extract the exact numerical tables (wins, p‑values, costs) into CSV format for your analysis. - Propose an economic evaluation checklist for model procurement that operationalizes these findings.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Static benchmarks capture only part of how large language models behave in practice. Other | negative | high | coverage_of_model_behavior_by_static_benchmarks |
0.48
|
| We study this setting in a timed multi-phase Risk environment with explicit victory targets and repeated planning and execution cycles. Other | null_result | high | experimental_environment_description |
0.8
|
| In a replicated 32-game cross-provider championship under frozen rules, gemini-3.1-pro-preview won 20 of 32 games against gpt-5.1, claude-opus-4-7, and kimi-k2.6, and the pooled winner distribution differs strongly from an equal-strength null (p approx 1.5 x 10^-5). Other | positive | high | wins / pooled winner distribution |
n=32
20 of 32 games (p approx 1.5 x 10^-5)
0.8
|
| When execution is standardized on a cheaper Gemini Flash scaffold (separating planning from execution), a pooled 32-game planner bakeoff is consistent with near-equality (p approx 0.821). Other | null_result | high | planner performance equality (pooled test) |
n=32
p approx 0.821
0.8
|
| Much of the earlier provider spread came from end-to-end system behavior rather than planning alone. Other | mixed | high | source_of_provider_performance_spread (end-to-end vs planning) |
n=32
0.48
|
| Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches. Decision Quality | positive | medium | frequency_of_terminal_objective_references (objective tracking) |
0.29
|
| Gemini converts more turns into deep conquest chains, even though it is not the cleanest runtime. Output Quality | positive | medium | conversion_rate_of_turns_into_deep_conquest_chains |
0.29
|
| Live-agent performance depends on objective tracking, execution conversion, cost, and runtime reliability, supporting evaluation of LLMs as components in bounded workflows rather than as isolated benchmark respondents. Other | positive | high | factors_affecting_live-agent_performance |
0.48
|