Per-token billing for commercial LLMs is vulnerable to large-scale overcharging: providers who conceal the model, tokenizer or execution can inflate reported token usage by hundreds of percent, turning modest honest bills into orders-of-magnitude larger charges; honest pricing will require evidence not controlled by the provider (e.g., attestation, cryptographic proofs, or third-party re-execution).

Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage

Shahinul Hoque, Jinghuai Zhang, Jinyuan Sun, Fnu Suya · May 28, 2026

arxiv descriptive medium evidence 9/10 relevance Source PDF

Per-token billing for commercial LLMs is hard to audit because providers can hide core artifacts (model, tokenizer, execution), enabling systematic and large over-reporting of billed tokens—simulations show inflation as high as 1,469% and sizable undetectable overcharges from tokenization ambiguity.

Per-token billing is now the standard pricing model for commercial large language models (LLMs), so the honesty of reported token counts directly affects what users pay. We show that this kind of billing is hard to audit by design: providers hide the model, the tokenizer, and the execution to protect their IP, mitigate jailbreaks, and preserve user privacy, which means an auditor can only inspect proofs the provider supplies. The audit therefore reduces to a consistency check on the provider's own reports. We call this a trust paradox: every audit must trust some artifact, but current frameworks trust exactly the ones a provider has the strongest reason to manipulate. We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts. In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection. At current frontier reasoning prices, that turns a \$100 honest bill into roughly a \$1,569 bill on the same query. Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold. These results suggest the problem is not in any specific auditor but in any audit whose evidence comes from the audited party. Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution.

Summary

Main Finding

Per-token billing for LLMs is economically vulnerable when providers control the artifacts auditors rely on. Across three recent auditing frameworks (CoIn, PALACE, and a martingale-based statistical auditor), a financially motivated provider can cheaply and reliably over-report billed token counts while passing audits. Attacks produced average over-reporting up to 1,469% in the CoIn setting, ~247% in PALACE-style settings that trust provider-supplied training data, and 50.85% even when users can see the full reasoning string. Small per-query inflations scale to large overcharges at current reasoning prices (example: a $100 honest bill → ≈ $1,569 with CoIn-style inflation).

Key Points

Trust paradox: audits require evidence, but commercial providers hide model, tokenizer, and execution for IP, safety, and privacy. Auditors therefore validate provider-supplied artifacts — exactly the data the provider can manipulate.
CoIn (semantic-similarity based): provider can (a) duplicate or perturb honest reasoning blocks, (b) select or synthesize high-scoring blocks, or (c) fabricate answer-conditioned blocks. Measured mean inflation by attack variant: RB 549.9%, DA 1469.0%, HSB 1004.0%, T5-generated 770.0%, hash-unique variants and others 536.8–984.8%. Attacks take seconds per sample and evade both rule-based and learned verifiers.
PALACE (predict hidden length from prompt+answer; auditor trained on provider-supplied auxiliary data):
- Inference-time steering: appending innocuous tokens (e.g., "boxed", "nil", "9") or answer-style rewrites increases predicted token counts (e.g., +82.4 tokens for "boxed"; style rewrites increased predicted length by ~30.3% (long) and ~16.6% (verbose)).
- Training-time poisoning: injecting poisoned provider-supplied auxiliary data can backdoor the auditor, producing targeted overestimates. The paper reports up to ~247% inflation in settings where the auditor trusts provider-supplied data.
Martingale/statistical auditor (user sees full reasoning): tokenization ambiguity and selective per-sample inflation still permit ~50.85% over-reporting below detection thresholds.
Defenses that only change or harden the learned checks (stronger embeddings, finer discriminators, canonicalizing answers, duplicate-hash rejection) are insufficient because the provider has white-box access to the checks and can adapt cheaply.
Remedies suggested: bind token reports to evidence outside provider control — e.g., trusted execution attestation (TEEs), cryptographic proofs of inference, or deterministic third-party re-execution.

Data & Methods

Frameworks evaluated: CoIn (Sun et al. 2025), PALACE (Wang et al. 2025), martingale-based statistical auditor (Velasco et al. 2025).
Datasets: Glaive reasoning-v1-20m (used in CoIn and PALACE evaluations).
Attack techniques:
- CoIn: block duplication, hash-unique duplicates (minor perturbations to alter hashes), highest-scoring-block selection, generative fabrication (fine-tuned T5, LoRA Qwen2.5-1.5B).
- PALACE inference attacks: append trigger tokens; rewrite answers using a separate model (Qwen3-14B) to change surface form without changing semantics.
- PALACE training attacks: inject poisoned auxiliary (P,A,length) pairs (e.g., replace low true counts with inflated counts for a fraction of training data) to create over-estimating/backdoored auditors.
- Martingale/statistical attacks: exploit tokenizer ambiguity and selective inflation across samples to remain under detection thresholds.
Operational details: attacks require only small compute (seconds per sample) and exploit the fact auditors use provider-supplied embeddings, answers, or auxiliary data.

Implications for AI Economics

Billing integrity risk: per-token pricing — widely used on commercial LLM platforms — can be systematically abused, producing large, persistent rent extraction from users and enterprises at scale.
Misaligned incentives and market distortion: providers have incentives to over-report usage while simultaneously having legitimate reasons to keep internals opaque; this structural misalignment undermines market trust and may raise demand for more verifiable alternatives.
Audit economics: effective auditing that prevents these attacks will be costlier (TEEs, cryptographic proofs, or third-party re-execution). Those costs may be borne by providers, auditors, or passed to users, changing the unit economics of LLM services and raising barriers to entry for smaller providers.
Contracting and regulation: commercial contracts and procurement for AI services should account for verification risk. Regulators or large cloud customers may demand technical attestations or independent re-execution for high-stakes billing, or move away from per-token pricing toward pricing models less sensitive to hidden computation (flat fees, outcome-based).
Competitive and privacy trade-offs: solutions that enable verifiability (e.g., re-execution, TEEs) can weaken IP/safety/privacy protections or increase costs/latency. Policymakers and firms must balance verifiability against these concerns.
Research agenda: the paper highlights the need for practical, low-cost cryptographic or hardware-backed proofs of inference, or auditing designs that rely on evidence outside provider control. Without such mechanisms, per-token billing will remain vulnerable to manipulation.

Summary conclusion: auditing schemes that rely on provider-controlled artifacts are fundamentally fragile. Restoring honest per-token billing at scale requires verifiable evidence that the provider cannot cheaply fabricate or adapt — a change with material economic, technical, and policy consequences for the LLM service market.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper gives quantitative, reproducible demonstrations that show large possible over-reporting under realistic technical assumptions and across multiple auditing frameworks, but the results depend on the chosen threat model, simulated/controlled experiments rather than field data from real commercial providers, and on assumptions about what providers can hide or control. Methods Rigormedium — Analysis combines principled reasoning about tokenization and audit-design with targeted experiments across three frameworks and multiple visibility settings; however, it does not evaluate a wide cross-section of real-world providers, models, or contractual/auditing practices, and conclusions hinge on specific adversary capabilities and tokenizers used in the demonstrations. SampleExperiments on three recent token-auditing frameworks using controlled/simulated provider implementations that hide model, tokenizer, and execution; a set of representative prompts/queries including high-cost 'reasoning' workloads; evaluation under multiple auditor-visibility regimes (fully hidden, visible reasoning string, etc.); measurements of reported vs true token counts and resulting billed amounts (examples include up to 1,469% inflation and ~50.85% undetectable inflation from tokenization ambiguity). Themesgovernance adoption IdentificationConstruct adversarial provider strategies and apply them to three existing token-auditing frameworks under varied auditor visibility scenarios; measure discrepancies between true token usage (as controlled by the experimenter) and the provider-reported counts to demonstrate systematic over-reporting. GeneralizabilityFindings assume providers can hide model/tokenizer/execution — may not apply where hardware attestation or independent re-execution is enforced., Only three auditing frameworks were evaluated; other or future frameworks might mitigate the attacks., Simulated provider behavior may differ from incentives, legal constraints, or operational risks real commercial providers face., Results depend on specific tokenizers and prompt distributions used — different tokenization choices could change magnitudes., Does not quantify prevalence of such cheating in the wild or the detectability under advanced legal/contractual audits.

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Per-token billing is now the standard pricing model for commercial large language models (LLMs). Adoption Rate	positive	high	pricing model (per-token adoption)	0.18
Providers hide the model, the tokenizer, and the execution to protect their IP, mitigate jailbreaks, and preserve user privacy, which means an auditor can only inspect proofs the provider supplies. Governance And Regulation	negative	high	auditability (availability of independent evidence)	0.18
The audit therefore reduces to a consistency check on the provider's own reports. Governance And Regulation	negative	high	audit method (reliance on provider-supplied reports)	0.18
We call this a trust paradox: every audit must trust some artifact, but current frameworks trust exactly the ones a provider has the strongest reason to manipulate. Governance And Regulation	negative	high	trust dependencies in auditing frameworks	0.18
We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts. Firm Revenue	positive	high	ability to inflate billed token counts (systematic over-reporting)	n=3 0.18
In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection. Firm Revenue	positive	high	percent over-reporting of hidden reasoning token usage	1,469% on average 0.18
At current frontier reasoning prices, that turns a $100 honest bill into roughly a $1,569 bill on the same query. Consumer Welfare	positive	high	billed dollar amount for same query	roughly a $1,569 bill on the same query (from a $100 honest bill) 0.18
Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold. Consumer Welfare	positive	high	percent over-reporting of billed tokens due to tokenization ambiguity	50.85% over-reporting 0.18
These results suggest the problem is not in any specific auditor but in any audit whose evidence comes from the audited party. Governance And Regulation	negative	high	robustness of auditing approaches that rely solely on provider-supplied evidence	0.18
Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution. Governance And Regulation	positive	high	requirements for restoring honest billing (types of verification needed)	0.03