Endpoint-level differences — not just model families — drive large swings in accuracy, latency and cost: the same model served from different endpoints can vary by over 10 percentage points in accuracy, an order of magnitude in tail latency, and multiple-fold in energy and dollars per correct answer, changing which endpoints look best once realistic workloads are considered.
Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving stack is exposed. We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). The framework's novelty is empirical and methodological. Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code, in fingerprint similarity to first party by up to 12 points, in tail latency by an order of magnitude, and in modeled joules per correct answer by a factor of 6.2. We further show that workload-aware blended pricing reorders the leaderboard substantially: 7 of 10 top-ranked endpoints under the chat preset (3:1 input:output) fall out of the top 10 under the retrieval-augmented preset (20:1), and the reasoning preset (1:5) elevates frontier closed models that the chat preset penalizes on price. We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0. TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication.
Summary
Main Finding
Token Arena introduces an endpoint-level, workload-aware, energy-inclusive continuous benchmark for inference that reveals large, economically meaningful variation hidden by model- or provider-level leaderboards. Measuring at the (provider, model, SKU, precision, decoding, region) endpoint granularity and combining five operational and quality axes into composites (joules per correct answer, dollars per correct answer, and endpoint fidelity) materially changes which deployments are optimal under real workloads and detects silent deployment differences (e.g., undisclosed quantization).
Key Points
- Unit of analysis: the endpoint e = (provider, model, sku, precision, decoding, region). The same model name across endpoints can differ substantially in speed, cost, quality, latency tails, effective context, and energy.
- Token thesis: joint economic and cognitive accounting at the token level. Headline metrics:
- JCA(e) = je · Te / Ae (joules per correct answer)
- CCA(e) = pe · Te / Ae (dollars per correct answer) where je = joules per output token, pe = blended dollar price per token, Te = tokens-to-solution, Ae = accuracy.
- Five core measured factors per endpoint: output speed, time-to-first-token (TTFT / TTFV), workload-blended price, effective context, and live-endpoint quality (multi-task eval suite).
- Endpoint fidelity: fingerprinting by comparing token-level output distributions on a fixed reference set to a first‑party reference using symmetrized KL; flags for faithful / drifted / quantized or modified.
- Workload presets: multiple presets (chat, RAG, reasoning, voice, coding, batch, etc.) each recompute blended prices and composite weights; rankings change substantially by workload.
- Modeled energy: per-endpoint J/token modeled from public TDP, utilization (vendor-disclosed or conservative default), PUE, sparsity, and regional grid intensity (ElectricityMaps).
- Findings demonstrated empirically across v1.0 registry: 78 endpoints, 12 model families, 33 providers.
Selected empirical highlights (v1.0 snapshot, example gpt-oss-120B, n=19): - Output speed variation up to 12×; blended price 3.3×; modeled J/token 3.4×. - J/ correct answer varied by factor 6.2; $/correct by 5.0×. - Accuracy gaps up to ~12.5 percentage points on hard reasoning/code (AIME 2025) driven by SKU differences (FP8 Turbo vs BF16). - Fingerprint separated FP8/Turbo SKUs from BF16 at ~92 F vs ~99.7 ref; math and code drops of 4–9 points observed despite MMLU-style tests showing little difference. - Top-10 leaderboard overlap across workload presets often only 30–40%; workload choice materially reorders providers. - Sensitivity: ±10 percentage point perturbations in factor weights produce small rank shifts (leader invariant); ablations show price and quality are the dominant influences on top-of-leaderboard.
Data & Methods
- Registry and coverage: 78 live endpoints across 33 providers and 12 model families; intentionally oversampled gpt-oss-120B (19 endpoints) and Llama 3.3 70B (16) to enable within-model analyses.
- Three independent measurement loops:
- Probe (continuous, every 5 min): TTFT, per-token timing, throughput at multiple input lengths (1K/10K/100K), concurrency {1,10,100}, regions (US‑East, EU‑Central, APAC‑Singapore); response hashes to detect drift/errors.
- Eval (daily/weekly): daily compact high‑signal subsets (e.g., GSM8K-1k, HumanEval+, MATH-100); weekly full suite including MMLU-Pro, GPQA-Diamond, MATH-500, AIME2025, LiveCodeBench, IFBench, AA-LCR, τ2-Bench Telecom; captures accuracy, tokens-to-solution, cost, and fingerprint distributions.
- Energy/Pricing (daily): collects published list prices (input/output/cached), regional grid intensity, and modeled J/token.
- Composite scoring: cohort-normalized (min–max within model class) five factors combined by preset-specific weights into TAπ(e) (speed, TTFT, price, quality, reliability). Normalization prevents small, fast models from dominating frontier-model leaderboards on raw speed.
- Endpoint fidelity: compute token-level output distributions at temperature 0 for a fixed 1,024‑prompt reference; symmetrized KL divergence to first‑party reference -> fidelity F (0–100) with thresholds for faithful/drifted/modified.
- Modeled energy formula (summary): je = TDP_e × u_e × PUE_e × (1 − σ_e) / tokens_per_sec_e; kWh and gCO2 per 1M tokens computed using ElectricityMaps 30‑day average grid intensity; conservative bias toward higher energy where uncertain.
- Reproducibility: all schema, probe/eval harness, modeled energy table, and v1.0 leaderboard snapshot released under CC BY 4.0; full provenance and limitations documented.
Implications for AI Economics
- Procurement & supplier selection: procurement decisions based solely on model name or provider-level leaderboards risk large cost/quality/energy errors. Buyers should evaluate endpoints under their actual workload presets (input:output ratios) and include J/CCA and $/CCA, not just $/M tokens.
- Cost modelling and pricing strategies: workload-aware blended pricing can reorder economic rankings (e.g., RAG workflows with high input ratios benefit providers with cheaper input pricing). Providers can optimize SKU and pricing to win particular workload niches.
- Energy accounting and carbon policy: JCA provides an operational energy efficiency metric linked to task success — useful for corporate carbon accounting and for regulators seeking production-relevant footprints (but note Token Arena uses modeled, not metered, energy).
- Market differentiation and competition: custom-silicon deployments (Cerebras, Groq) and certain serverless SKUs can be competitively advantaged on TTFT, price, and energy, while closed frontier models can dominate reasoning workloads where quality per correct answer matters more than raw price.
- Transparency and market trust: fingerprinting highlights the economic value of SKU and precision transparency. Undisclosed quantization can materially degrade task outcomes and thus economic metrics — suggesting a role for disclosure standards, procurement clauses, or third‑party verification.
- Risk management: tail latency, reliability, and silent fidelity drift are economically relevant (affecting user experience and cost per correct answer). Contracts and SLAs should include endpoint-level metrics (TTFT/TTFV, P99s, fidelity) rather than provider/model averages.
- Benchmarking practice: public inference benchmarks should shift from model/provider aggregates to endpoint- and workload-specific measures and include energy-per-correct-answer to align benchmarking with deployment economics.
Limitations (not exhaustive): energy is modeled, not metered; probes originate from limited regions and fixed concurrency profiles; fidelity relies on agreed first‑party references (some open-weight cases use highest-fidelity third-party reference); the framework is a methodology rather than an immutable ranking — results depend on preset choices and probe conditions.
Overall, Token Arena supplies a practical, repeatable measurement stack for connecting inference operational choices (SKU, quantization, region, serving stack) to economic and environmental metrics that matter to buyers and policymakers.
Assessment
Claims (10)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Public inference benchmarks compare AI systems at the model and provider level, but the unit at which deployment decisions are actually made is the endpoint: the (provider, model, stock-keeping-unit) tuple at which a specific quantization, decoding strategy, region, and serving stack is exposed. Organizational Efficiency | null_result | high | granularity of benchmarking vs. deployment decision unit (endpoint = provider, model, SKU) |
0.03
|
| We introduce TokenArena, a continuous benchmark that measures inference at endpoint granularity along five core axes (output speed, time to first token, workload-blended price, effective context, and quality on the live endpoint) and synthesizes them, together with a modeled energy estimate, into three headline composites: joules per correct answer, dollars per correct answer, and endpoint fidelity (output-distribution similarity to a first-party reference). Organizational Efficiency | positive | high | five core axes (output speed, time to first token, workload-blended price, effective context, quality on live endpoint) and composites (joules per correct answer, dollars per correct answer, endpoint fidelity) |
0.3
|
| Across 78 endpoints serving 12 model families, the same model on different endpoints differs in mean accuracy by up to 12.5 points on math and code. Output Quality | mixed | high | mean accuracy on math and code benchmarks |
n=78
up to 12.5 points
0.18
|
| The same model on different endpoints differs in fingerprint similarity to first party by up to 12 points. Ai Safety And Ethics | mixed | high | fingerprint similarity to first-party reference (endpoint fidelity) |
n=78
up to 12 points
0.18
|
| Across 78 endpoints, the same model on different endpoints differs in tail latency by an order of magnitude. Task Completion Time | mixed | high | tail latency |
n=78
an order of magnitude
0.18
|
| Modeled joules per correct answer varies by a factor of 6.2 across endpoints. Organizational Efficiency | mixed | high | joules per correct answer (modeled energy efficiency) |
n=78
factor of 6.2
0.18
|
| Workload-aware blended pricing reorders the leaderboard substantially: 7 of 10 top-ranked endpoints under the chat preset (3:1 input:output) fall out of the top 10 under the retrieval-augmented preset (20:1). Adoption Rate | mixed | high | change in top-10 endpoint rankings between workload presets |
n=10
7 of 10 top-ranked endpoints fall out of the top 10
0.18
|
| The reasoning preset (1:5 input:output) elevates frontier closed models that the chat preset penalizes on price. Adoption Rate | positive | medium | leaderboard ranking changes for frontier closed models under the reasoning preset versus chat preset |
0.11
|
| We release the framework, schema, probe and eval harness, and a v1.0 leaderboard snapshot under CC BY 4.0. Other | positive | high | availability of TokenArena artifacts and leaderboard under CC BY 4.0 |
0.3
|
| TokenArena is a methodology, not a single ranking; we publish full provenance and limitations and welcome external replication. Other | positive | high | positioning of TokenArena as a methodological framework with published provenance and limitations |
0.3
|