Endpoint-level differences — not just model families — drive large swings in accuracy, latency and cost: the same model served from different endpoints can vary by over 10 percentage points in accuracy, an order of magnitude in tail latency, and multiple-fold in energy and dollars per correct answer, changing which endpoints look best once realistic workloads are considered.

Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

Yuxuan Gao, Megan Wang, Yi Ling Yu · May 01, 2026

arxiv descriptive medium evidence 8/10 relevance Source PDF

TokenArena is a released benchmarking framework that measures inference performance and economics at endpoint granularity and shows large endpoint-to-endpoint variation in accuracy, latency, price and modeled energy that materially reshuffles cost-performance leaderboards under different workloads.

Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving stack is exposed. We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). The framework's novelty is empirical and methodological. Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code, in fingerprint similarity to first party by up to 12 points, in tail latency by an order of magnitude, and in modeled joules per correct answer by a factor of 6.2. We further show that workload-aware blended pricing reorders the leaderboard substantially: 7 of 10 top-ranked endpoints under the chat preset (3:1 input:output) fall out of the top 10 under the retrieval-augmented preset (20:1), and the reasoning preset (1:5) elevates frontier closed models that the chat preset penalizes on price. We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0. TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication.

Summary

Main Finding

Token Arena introduces an endpoint-level, workload-aware, energy-inclusive continuous benchmark for inference that reveals large, economically meaningful variation hidden by model- or provider-level leaderboards. Measuring at the (provider, model, SKU, precision, decoding, region) endpoint granularity and combining five operational and quality axes into composites (joules per correct answer, dollars per correct answer, and endpoint fidelity) materially changes which deployments are optimal under real workloads and detects silent deployment differences (e.g., undisclosed quantization).

Key Points

Unit of analysis: the endpoint e = (provider, model, sku, precision, decoding, region). The same model name across endpoints can differ substantially in speed, cost, quality, latency tails, effective context, and energy.
Token thesis: joint economic and cognitive accounting at the token level. Headline metrics:
- JCA(e) = je · Te / Ae (joules per correct answer)
- CCA(e) = pe · Te / Ae (dollars per correct answer) where je = joules per output token, pe = blended dollar price per token, Te = tokens-to-solution, Ae = accuracy.
Five core measured factors per endpoint: output speed, time-to-first-token (TTFT / TTFV), workload-blended price, effective context, and live-endpoint quality (multi-task eval suite).
Endpoint fidelity: fingerprinting by comparing token-level output distributions on a fixed reference set to a first‑party reference using symmetrized KL; flags for faithful / drifted / quantized or modified.
Workload presets: multiple presets (chat, RAG, reasoning, voice, coding, batch, etc.) each recompute blended prices and composite weights; rankings change substantially by workload.
Modeled energy: per-endpoint J/token modeled from public TDP, utilization (vendor-disclosed or conservative default), PUE, sparsity, and regional grid intensity (ElectricityMaps).
Findings demonstrated empirically across v1.0 registry: 78 endpoints, 12 model families, 33 providers.

Selected empirical highlights (v1.0 snapshot, example gpt-oss-120B, n=19): - Output speed variation up to 12×; blended price 3.3×; modeled J/token 3.4×. - J/ correct answer varied by factor 6.2; $/correct by 5.0×. - Accuracy gaps up to ~12.5 percentage points on hard reasoning/code (AIME 2025) driven by SKU differences (FP8 Turbo vs BF16). - Fingerprint separated FP8/Turbo SKUs from BF16 at ~92 F vs ~99.7 ref; math and code drops of 4–9 points observed despite MMLU-style tests showing little difference. - Top-10 leaderboard overlap across workload presets often only 30–40%; workload choice materially reorders providers. - Sensitivity: ±10 percentage point perturbations in factor weights produce small rank shifts (leader invariant); ablations show price and quality are the dominant influences on top-of-leaderboard.

Data & Methods

Registry and coverage: 78 live endpoints across 33 providers and 12 model families; intentionally oversampled gpt-oss-120B (19 endpoints) and Llama 3.3 70B (16) to enable within-model analyses.
Three independent measurement loops:
- Probe (continuous, every 5 min): TTFT, per-token timing, throughput at multiple input lengths (1K/10K/100K), concurrency {1,10,100}, regions (US‑East, EU‑Central, APAC‑Singapore); response hashes to detect drift/errors.
- Eval (daily/weekly): daily compact high‑signal subsets (e.g., GSM8K-1k, HumanEval+, MATH-100); weekly full suite including MMLU-Pro, GPQA-Diamond, MATH-500, AIME2025, LiveCodeBench, IFBench, AA-LCR, τ2-Bench Telecom; captures accuracy, tokens-to-solution, cost, and fingerprint distributions.
- Energy/Pricing (daily): collects published list prices (input/output/cached), regional grid intensity, and modeled J/token.
Composite scoring: cohort-normalized (min–max within model class) five factors combined by preset-specific weights into TAπ(e) (speed, TTFT, price, quality, reliability). Normalization prevents small, fast models from dominating frontier-model leaderboards on raw speed.
Endpoint fidelity: compute token-level output distributions at temperature 0 for a fixed 1,024‑prompt reference; symmetrized KL divergence to first‑party reference -> fidelity F (0–100) with thresholds for faithful/drifted/modified.
Modeled energy formula (summary): je = TDP_e × u_e × PUE_e × (1 − σ_e) / tokens_per_sec_e; kWh and gCO2 per 1M tokens computed using ElectricityMaps 30‑day average grid intensity; conservative bias toward higher energy where uncertain.
Reproducibility: all schema, probe/eval harness, modeled energy table, and v1.0 leaderboard snapshot released under CC BY 4.0; full provenance and limitations documented.

Implications for AI Economics

Procurement & supplier selection: procurement decisions based solely on model name or provider-level leaderboards risk large cost/quality/energy errors. Buyers should evaluate endpoints under their actual workload presets (input:output ratios) and include J/CCA and $/CCA, not just $/M tokens.
Cost modelling and pricing strategies: workload-aware blended pricing can reorder economic rankings (e.g., RAG workflows with high input ratios benefit providers with cheaper input pricing). Providers can optimize SKU and pricing to win particular workload niches.
Energy accounting and carbon policy: JCA provides an operational energy efficiency metric linked to task success — useful for corporate carbon accounting and for regulators seeking production-relevant footprints (but note Token Arena uses modeled, not metered, energy).
Market differentiation and competition: custom-silicon deployments (Cerebras, Groq) and certain serverless SKUs can be competitively advantaged on TTFT, price, and energy, while closed frontier models can dominate reasoning workloads where quality per correct answer matters more than raw price.
Transparency and market trust: fingerprinting highlights the economic value of SKU and precision transparency. Undisclosed quantization can materially degrade task outcomes and thus economic metrics — suggesting a role for disclosure standards, procurement clauses, or third‑party verification.
Risk management: tail latency, reliability, and silent fidelity drift are economically relevant (affecting user experience and cost per correct answer). Contracts and SLAs should include endpoint-level metrics (TTFT/TTFV, P99s, fidelity) rather than provider/model averages.
Benchmarking practice: public inference benchmarks should shift from model/provider aggregates to endpoint- and workload-specific measures and include energy-per-correct-answer to align benchmarking with deployment economics.

Limitations (not exhaustive): energy is modeled, not metered; probes originate from limited regions and fixed concurrency profiles; fidelity relies on agreed first‑party references (some open-weight cases use highest-fidelity third-party reference); the framework is a methodology rather than an immutable ranking — results depend on preset choices and probe conditions.

Overall, Token Arena supplies a practical, repeatable measurement stack for connecting inference operational choices (SKU, quantization, region, serving stack) to economic and environmental metrics that matter to buyers and policymakers.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides systematic, reproducible measurements across 78 real-world endpoints and multiple tasks, supporting descriptive claims about heterogeneity in latency, accuracy, price and modeled energy; however, several metrics (energy, blended price) are modeled rather than directly measured, the endpoint sample is not exhaustive, and results are sensitive to workload presets and rapidly changing commercial configurations. Methods Rigorhigh — The authors develop a multi-axis benchmarking framework, measure a large set of endpoints and model families, publish the probe/eval harness and provenance, and report multiple complementary metrics and presets — indicating careful design and transparency — though some components rely on assumptions (energy and blended-pricing models) that introduce uncertainty. SampleEmpirical benchmark over 78 deployed endpoints spanning 12 model families; evaluation uses probes on math and code benchmarks (among other tasks), measures five core axes (output speed, time-to-first-token, workload-blended price, effective context, and live-endpoint quality), and synthesizes those with a modeled energy estimate into composites (joules per correct answer, dollars per correct answer, endpoint fidelity); includes three workload presets (chat, retrieval-augmented, reasoning) and publishes a v1.0 leaderboard and full provenance. Themesadoption productivity GeneralizabilityEndpoints sampled (78) and 12 model families may not represent the full commercial landscape; many providers, regions, or SKU configurations absent., Workload presets (input:output ratios, prompt styles) may not match particular real-world applications, altering leaderboard rank order., Modeled energy and blended-pricing rely on assumptions and public pricing which can differ from negotiated enterprise contracts and actual data-center efficiency., Results are time-sensitive — endpoint implementations, quantization, decoding defaults, and prices change frequently and can make snapshots quickly stale., Closed-source/first-party reference behavior and regional serving-stack differences limit direct reproducibility for some endpoints.

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving stack is exposed. Organizational Efficiency	null_result	high	granularity of benchmarking vs. deployment decision unit (endpoint = provider, model, SKU)	0.03
We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). Organizational Efficiency	positive	high	five core axes (output speed, time to first token, workload-blended price, effective context, quality on live endpoint) and composites (joules per correct answer, dollars per correct answer, endpoint fidelity)	0.3
Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code. Output Quality	mixed	high	mean accuracy on math and code benchmarks	n=78 up to 12.5 points 0.18
The same model on different endpoints differs in fingerprint similarity to first party by up to 12 points. Ai Safety And Ethics	mixed	high	fingerprint similarity to first-party reference (endpoint fidelity)	n=78 up to 12 points 0.18
Across 78 endpoints, the same model on different endpoints differs in tail latency by an order of magnitude. Task Completion Time	mixed	high	tail latency	n=78 an order of magnitude 0.18
Modeled joules per correct answer varies by a factor of 6.2 across endpoints. Organizational Efficiency	mixed	high	joules per correct answer (modeled energy efficiency)	n=78 factor of 6.2 0.18
Workload-aware blended pricing reorders the leaderboard substantially: 7 of 10 top-ranked endpoints under the chat preset (3:1 input:output) fall out of the top 10 under the retrieval-augmented preset (20:1). Adoption Rate	mixed	high	change in top-10 endpoint rankings between workload presets	n=10 7 of 10 top-ranked endpoints fall out of the top 10 0.18
The reasoning preset (1:5 input:output) elevates frontier closed models that the chat preset penalizes on price. Adoption Rate	positive	medium	leaderboard ranking changes for frontier closed models under the reasoning preset versus chat preset	0.11
We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0. Other	positive	high	availability of TokenArena artifacts and leaderboard under CC BY 4.0	0.3
TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication. Other	positive	high	positioning of TokenArena as a methodological framework with published provenance and limitations	0.3