A new Pokemon-based benchmark exposes large capability gaps on multi-agent, partial-observability and long-horizon planning tasks: specialist RL systems and human experts outperform generalist LLMs, and a 20M+ trajectory dataset plus a NeurIPS competition confirm strong community interest and reproducible evaluation.

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang, Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin · March 16, 2026

arxiv descriptive n/a evidence 7/10 relevance Source PDF

PokeAgent is a large, Pokémon-based multi-agent benchmark (20M+ battle trajectories plus speedrun scenarios and orchestration tools) that reveals substantial gaps between generalist LLMs, specialist RL agents, and elite humans on partial-observability, game-theoretic, and long-horizon planning tasks and is validated through a live leaderboard and a NeurIPS competition.

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

Summary

Main Finding

PokeAgent Challenge introduces a large, realistic multi-agent benchmark built on Pokemon that stresses partial observability, game-theoretic reasoning, and long-horizon planning simultaneously. The benchmark—split into a Battling Track (20M+ battle trajectories) and a Speedrunning Track (RPG long-horizon tasks and a multi-agent orchestration harness)—reveals substantial performance gaps between generalist LLMs, specialist RL agents, and elite humans, and measures capabilities not captured by standard LLM suites. The project is live (leaderboard + self-contained eval) and validated by a NeurIPS 2025 competition with 100+ teams.

Key Points

Two complementary tracks:
- Battling Track: competitive, partial-observability, game-theoretic battles; includes a 20M+ trajectory dataset and baseline agents (heuristic, RL, LLM).
- Speedrunning Track: long-horizon RPG tasks requiring sequential planning; includes an open-source multi-agent orchestration system for harness-based LLM comparisons.
Baselines span heuristic, reinforcement-learning, and LLM-based approaches; top-performing community submissions still leave a gap to elite human play.
BenchPress evaluation shows Pokemon battling evaluates capabilities largely orthogonal to common LLM benchmarks—i.e., it stresses different skill sets.
Transitioned to a living benchmark: Battling has a live leaderboard; Speedrunning uses self-contained evaluation to ensure reproducibility.
Community interest validated by NeurIPS 2025 competition (100+ teams) and documented winning solutions.

Data & Methods

Datasets and assets:
- Battling Track dataset: >20 million recorded battle trajectories covering strategic interactions and partial observability.
- Speedrunning Track: standardized evaluation scenarios and tools for reproducible multi-agent orchestration.
Baselines:
- Heuristic rule-based agents for domain knowledge replication.
- RL agents trained for specialist play.
- LLM-based agents/harnesses for generalist approaches and modular orchestration.
Evaluation:
- Task-specific metrics for win-rate, completion time (speedruns), and strategic robustness.
- BenchPress matrix applied to quantify coverage relative to standard benchmarks, showing near-orthogonality for battling tasks.
- Competition-driven evaluation (NeurIPS 2025) to stress-test resource-constrained and novel approaches; results published with analyses of submissions.
Reproducibility:
- Open-source orchestration and evaluation harnesses for the Speedrunning Track.
- Live leaderboard and self-contained evaluation pipelines to maintain a living benchmark.

Implications for AI Economics

Incentives and investment allocation:
- Benchmarks like PokeAgent reallocate researcher and industry attention toward multi-agent, partial-observability, and long-horizon planning problems—likely increasing funding and compute investment in RL and hybrid LLM+RL methods.
- The clear performance gaps indicate high returns to specialized efforts (RL, domain-specific engineering) relative to generalist LLM-only approaches, shaping where teams invest labor and compute.
Market for talent and tools:
- Demand will grow for engineers skilled in multi-agent RL, game-theoretic reasoning, simulator engineering, and building harnesses for LLM orchestration—raising wages and contracting markets in these niches.
- Open-source orchestration lowers entry barriers, broadening participation and potentially compressing rents that would otherwise accrue to well-resourced incumbents.
Signaling and productization:
- Success on a high-visibility benchmark (and competition wins) becomes a stronger signal of practical multi-agent capability, affecting hiring, startup valuations, and academic prestige.
- Companies may productize benchmark innovations (simulators, evaluation suites, model stacks) creating new commercial tools/services.
Resource and externality considerations:
- Large-scale battlegrounds and competitions increase compute demand and associated costs; this has implications for budget allocation and environmental externalities for teams and funders.
- Living benchmarks require ongoing maintenance; sustained funding models (grants, sponsorships, community contributions) will influence who controls the benchmark and which research directions are favored.
Specialization vs generalization economics:
- The observed orthogonality to standard LLM benchmarks suggests distinct returns to specialization: investing to close domain-specific gaps can be more commercially valuable than marginally improving generalist models for some applications.
- Conversely, modular LLM orchestration (if improved) could lower the marginal cost of entering complex decision-making tasks, shifting competitive dynamics toward platform and tooling advantages.
Policy and strategic research implications:
- The benchmark provides a testbed for studying strategic behavior, coordination failures, and market-like interactions among agents—useful for economic research on algorithmic markets and strategic automation.
- It can inform policy debates on automation risk in strategic tasks and on regulation for deployments of multi-agent systems in economic settings.

If you want, I can: (a) produce a one-page slide-ready summary, (b) extract potential research questions for AI economics motivated by PokeAgent, or (c) map how deploying such benchmarks affects a research group's budget and hiring choices. Which would be most useful?

Assessment

Paper Typedescriptive Evidence Strengthn/a — This work is a benchmark and empirical evaluation platform rather than a causal study; it documents capabilities and performance gaps but does not attempt causal identification of economic impacts. Methods Rigorhigh — Large-scale curated datasets (20M+ battle trajectories), multiple baseline classes (heuristic, RL, LLM), task-specific metrics, open-source orchestration/evaluation harnesses, a live leaderboard and a NeurIPS 2025 competition with 100+ teams validate outcomes and improve reproducibility; main limitations are domain-specificity and evolving 'living' benchmark dynamics rather than shortcomings in experimental design. SampleTwo-track dataset and evaluation suite: Battling Track with >20 million recorded Pokemon battle trajectories capturing partial observability and strategic interactions; Speedrunning Track with standardized long-horizon RPG scenarios and an open-source multi-agent orchestration harness; baseline agents include heuristic rule-based controllers, specialist RL agents, and LLM-based orchestration agents; results further validated by a NeurIPS 2025 competition (100+ teams) and community submissions on a live leaderboard. Themesinnovation skills_training labor_markets adoption GeneralizabilityDomain-specific to Pokemon game mechanics and simulator rules — may not map directly to real-world economic or strategic settings., Simulated, rule-based environment with constrained action/state spaces; real-world multi-agent environments may present different noise, stakes, and incentives., Performance gaps depend on available compute, simulator fidelity, and task design; well-resourced teams may exploit optimizations not feasible for all actors., Human baselines drawn from expert players of the game may not reflect broader workforce capabilities or incentives in economic contexts., As a living benchmark, task revisions and leaderboard dynamics can change difficulty and comparability over time.

Claims (16)

Claim	Direction	Confidence	Outcome	Details
PokeAgent Challenge is a large, realistic multi-agent benchmark built on Pokemon that stresses partial observability, game-theoretic reasoning, and long-horizon planning simultaneously. Other	positive	high	benchmark task characteristics (partial observability, game-theoretic complexity, horizon length)	0.03
The Battling Track dataset contains more than 20 million recorded battle trajectories. Other	positive	high	number of recorded battle trajectories (>20,000,000)	n=20000000 0.03
The benchmark is split into two complementary tracks: a Battling Track (competitive, partial-observability battles) and a Speedrunning Track (long-horizon RPG tasks with a multi-agent orchestration harness). Other	positive	high	benchmark partitioning (presence of Battling and Speedrunning tracks)	0.03
Baselines include heuristic rule-based agents, reinforcement-learning (RL) agents trained for specialist play, and LLM-based agents/harnesses for generalist approaches. Other	positive	high	presence and types of baseline agents (heuristic, RL, LLM)	0.03
Top-performing community submissions (including baselines and competition entries) still leave a performance gap relative to elite human play on battling tasks. Output Quality	negative	medium	performance gap measured primarily by win-rate (Battling) and strategic robustness metrics	0.02
BenchPress evaluation shows Pokemon battling evaluates capabilities largely orthogonal to common LLM benchmarks (i.e., it stresses different skill sets). Other	mixed	medium	coverage/overlap metric from BenchPress matrix comparing PokeAgent Battling to standard LLM benchmarks	0.02
The project is a living benchmark: the Battling Track has a live leaderboard and the Speedrunning Track uses self-contained evaluation to ensure reproducibility. Other	positive	high	presence of live leaderboard and self-contained evaluation pipelines	0.03
Community interest in the benchmark was validated by a NeurIPS 2025 competition with 100+ teams and published analyses of winning submissions. Adoption Rate	positive	high	number of competing teams (100+), availability of competition analyses/winning solutions	n=100 0.03
Speedrunning Track includes an open-source multi-agent orchestration system and standardized evaluation scenarios for reproducible multi-agent comparisons. Other	positive	high	availability of open-source orchestration code and standardized evaluation scenarios	0.03
Evaluation metrics for the benchmark include task-specific metrics such as win-rate for battling and completion time for speedruns, as well as strategic robustness measures. Other	positive	high	evaluation metrics used (win-rate, completion time, strategic robustness)	0.03
Open-source orchestration and evaluation harnesses plus a self-contained evaluation pipeline improve reproducibility for the Speedrunning Track. Research Productivity	positive	medium	reproducibility capability via released code and self-contained pipelines	0.02
Benchmarks like PokeAgent will reallocate researcher and industry attention toward multi-agent, partial-observability, and long-horizon planning problems—likely increasing funding and compute investment in RL and hybrid LLM+RL methods. Innovation Output	positive	speculative	predicted shifts in researcher/industry attention and investment (qualitative forecast)	0.0
The clear performance gaps indicate high returns to specialized efforts (RL, domain-specific engineering) relative to generalist LLM-only approaches, shaping where teams invest labor and compute. Adoption Rate	positive	speculative	economic return on investment inference based on performance differences between specialist methods and LLM-only approaches	0.0
Open-source orchestration lowers entry barriers, broadening participation and potentially compressing rents that would otherwise accrue to well-resourced incumbents. Market Structure	positive	speculative	predicted change in barrier-to-entry and market rents (qualitative)	0.0
Large-scale battlegrounds and competitions increase compute demand and associated costs, with implications for budgets and environmental externalities. Fiscal And Macroeconomic	negative	speculative	predicted increase in compute demand and related costs/externalities (qualitative)	0.0
The benchmark provides a testbed useful for studying strategic behavior, coordination failures, and market-like interactions among agents, which can inform economic research and policy. Research Productivity	positive	speculative	utility of benchmark as a research/testbed for studying strategic/multi-agent phenomena (qualitative)	0.0