Can I Buy Your KV Cache? — The Commonplace

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

Summary

Main Finding

Loading a precomputed key–value (KV) cache for a token-identical document and continuing generation is token-exact (under greedy decoding) and massively cheaper than re-prefilling the same document. Measured on Qwen3-4B (fp16, MPS) reuse is 8.6–49.7× cheaper in compute than a full prefill across contexts of 255–3774 tokens, so a provider-hosted “prefill CDN” (publishers precompute and host KV; agents pay per cache-read) can immediately and substantially reduce aggregate compute cost and create a profitable market layer.

Key Points

Correctness
- Loading a saved KV and continuing is token-exact vs. from-scratch prefill + greedy decode (24/24 token match in the experiment).
- Maximum absolute logit difference observed ≈ 0.02 (attributed to FP nondeterminism), argmax matches.
Measured compute savings (Qwen3-4B, fp16, Apple M1 Pro / MPS; median of 5 runs)
- Context lengths and speedups (prefill time vs one reuse-step time):
  - 255 tokens: 0.67 s vs 0.078 s → 8.6×
  - 485 tokens: 1.29 s vs 0.093 s → 13.9×
  - 945 tokens: 2.65 s vs 0.122 s → 21.7×
  - 1888 tokens: 5.81 s vs 0.179 s → 32.4×
  - 3774 tokens: 14.71 s vs 0.296 s → 49.7×
Size and compression
- Analytic fp16 KV size ≈ 0.148 MB/token (36-layer model example); 3774 tokens → ~557 MB.
- Per-tensor symmetric int8 halves artifact size (e.g., 692-token example 102.1 → 51.1 MB) but breaks token-exactness (only 16/32 tokens match before divergence in that probe).
Cost model and break-even
- From-scratch per-call cost = C_prefill.
- Reuse per-call cost = C_prefill / N + C_load; break-even reuse count N* = C_prefill / (C_prefill − C_load) ≈ 1 when C_load ≪ C_prefill → reuse pays off by the second read.
- Include storage (σ), transfer/egress (β) and artifact size s: amortized per-load floor increases by βs + σsT/N; heavy reuse and compression lower these terms.
Economics & deployment
- Shipping artifacts to consumers fails economically under lossless transfer because egress costs (commodity $0.09/GB) can exceed saved prefill cost; hosting provider-side removes egress.
- Example (serving one hot 3774-token doc to N = 80M agents):
  - Re-prefill today: ~$1,509,600
  - Host + priced at a 0.1pin cache-read tariff: $150,960 (10× user discount compared to re-prefill)
  - Host + physical (measured Creuse/Cprefill = 1/49.7): $30,370 (49.7×)
  - Ship raw fp16 (egress): ~$3.9M (loses vs re-prefill)
  - Ship int8 (lossy): ~$2.0M (loses; inexact)
- The commonly seen 0.1pin cache-read tariff gives users ~10× discount; measured compute ratios (8.6–49.7×) imply large provider margin potential on popular documents.
Scope and limitations
- Study restricted to exact reuse of a single shared prefix (no cross-fragment fusion).
- Exact reuse requires same model and dtype between publisher and consumer (model-bound).
- Measurements from one device/model; absolute numbers are device/model-dependent.
- Lossy compression can reduce size but forfeits exactness; lossless compression of KV is an open problem.

Data & Methods

Model and hardware
- Qwen3-4B, fp16; measurements on Apple M1 Pro (MPS).
Algorithms (implemented/open-sourced)
- SAVEKV: publisher runs one forward pass on context c = x1:L, serializes per-layer (K, V) with metadata (model id, dtype, L).
- GENERATEFROMKV: consumer loads artifact, asserts model/dtype match, assigns correct position ids and full attention mask, decodes continuations using the loaded KV (no re-prefill).
Experiments
- Correctness: greedy decoding token match (24/24) and logits-level check (max abs diff ≈ 0.02).
- Compute timing: median of 5 runs for full prefill vs single continuation step over resident KV for contexts 255–3774 tokens (disk I/O excluded, resident KV).
- Artifact size: analytic fp16 size and an empirical int8 probe for compression vs exactness trade-off.
Cost modeling
- Simple per-call formulas and inclusion of storage (σ), egress (β), artifact size (s), hosting horizon T.
- Worked example computing costs for large-scale serving (80M consumers) combining measured compute ratios and typical egress prices.

Implications for AI Economics

New market layer: a “prefill CDN” where publishers publish KV artifacts and agents pay per cache-read (hosted by providers) is economically attractive in ecosystems that standardize on model + dtype.
Provider economics and margins
- Because Creuse ≪ Cprefill, providers can price cache-reads below the user’s re-prefill cost but above their marginal Creuse and still capture large margins on popular content.
- Existing cache-read tariffs (e.g., 0.1pin) are within the physically measured envelope; the experimentally observed 8.6–49.7× compute savings justify at least a 10× user discount while leaving room for provider profit.
Incentives & market structure
- Publishers can monetize one-time prefill compute; providers can monetize hosting/fast cache-reads.
- Wide deployment requires standardized model identifiers, dtype agreements, provenance and invalidation mechanisms (when models update), and an agent-payment rail for cross-party settlement.
Open problems that matter economically
- Lossless or high-quality selective compression that preserves token-exactness (or guarantees bounded divergence) — enables shipping artifacts or reduces hosting storage/egress exposure.
- Cross-model / cross-dtype interoperability or migration strategies (how to handle model upgrades and invalidate or reissue artifacts).
- Pricing, billing, and provenance standards for cross-party KV sharing and payments.
- Fusion methods that recover much of the benefit for non-prefix reuse while bounding accuracy loss — expands the addressable reusable content beyond perfect prefixes.
Deployment guidance
- Host KV in provider datacenters (avoid per-load egress) and bill reads; heavy reuse and longer documents yield larger absolute savings.
- Expect immediate ROI for popular documents (N ≈ 2 suffices) and increasing returns as reuse scales.

Short summary: publishing and hosting precomputed KV caches (prefill CDN) is technically feasible, token-exact in the prefix case, and economically compelling for popular, long documents when the KV is hosted provider-side. Key engineering and market work remains around compression, standardization, model upgrade handling, and cross-party payments.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides direct measurements (token-exactness checks and compute cost comparisons) on a real model (Qwen3-4B) and a concrete long-document example, giving credible engineering evidence for large compute savings; however it does not empirically establish market-level economic outcomes, lacks broad cross-model/decoder robustness checks, real-world deployment results, or sensitivity to pricing/egress regimes. Methods Rigormedium — Experiments include deterministic token/logit-level comparisons and measured compute costs, which are appropriate for the technical claim, but the evaluation appears limited to a single model and decoding regime (greedy/logit equality), without statistical reporting, robustness across model sizes/architectures, or deployment-scale traces—so methodology is sound for a proof-of-concept but not exhaustive. SampleTechnical experiments on Qwen3-4B showing token-exact equivalence (24/24 greedy tokens and logits) between prefill and loading a precomputed KV cache, measured reuse compute savings of 9–50x depending on sequence length, and an illustrative cost calculation for serving one 3774-token "hot" document to 80M agents (estimated $1.5M re-prefill vs $0.03M reuse). Themesadoption innovation GeneralizabilityEvaluated on a single model (Qwen3-4B); results may differ for other architectures, sizes, or attention implementations, Token-exactness shown for greedy decoding; sampling/temperature/stochastic decoding may break exact equivalence, KV size, compressibility, and read costs depend on quantization, model implementation, and hardware; shipping vs host economics vary by provider and region, Assumes publishers are willing/able to host precomputed KVs and that providers will support cross-agent loading and payment flows—institutional and legal constraints (copyright, privacy) could limit adoption, Multi-turn, stateful interactions and dynamic prompts (user-specific context) reduce opportunities for reusable prefill, Latency, CDN design, and security (poisoned or private KV data) were not empirically explored

Claims (8)

Claim	Direction	Outcome	Confidence & Evidence	Details
Right now, across the world, AI agents recompute a document's KV cache from scratch each time a document is read, so many agents redundantly re-run the compute-intensive prefill step on identical text. Organizational Efficiency	negative	redundant_compute_work (re-running prefill across agents)	Reading fidelity high Study strength low	0.09
Loading a publisher-precomputed KV cache and continuing is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. Output Quality	null_result	output_equivalence (token-level and logits-level match between reuse and full prefill)	Reading fidelity high Study strength medium	n=24 24/24 greedy tokens, and at the logits level 0.18
On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2). Organizational Efficiency	positive	compute_cost (reuse vs prefill)	Reading fidelity high Study strength medium	9-50x cheaper in compute than prefill 0.18
Shipping the KV (client-side delivery) fails because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Adoption Rate	negative	egress_cost_vs_prefill_savings	Reading fidelity medium Study strength medium	per-load egress costs more than the prefill it saves 0.11
Hosting the precomputed KV provider-side (removing egress) enables reuse without the egress cost, analogous to production prompt-caching. Adoption Rate	positive	egress_cost_elimination (by provider-side hosting)	Reading fidelity medium Study strength medium	removes egress entirely 0.11
Serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). Firm Revenue	positive	aggregate_cost_savings (prefill vs reuse) for a high-demand document	Reading fidelity high Study strength medium	n=80000000 $1.5M to re-prefill but only $0.03M of reuse compute (49.7x less) 0.18
APIs that charge a 0.1x cache-read tariff pass a 10x discount to users within this measured envelope; the measured ~50x compute saving exceeds that 10x, leaving provider margin measured in millions of dollars per popular document. Firm Revenue	mixed	user_price_discount and provider_margin (difference between compute saving and tariff pass-through)	Reading fidelity medium Study strength medium	0.1x cache-read tariff -> 10x discount; measured ~50x compute saving 0.11
Lossless KV compression and a cross-party payment layer remain open problems for implementing an agent-native prefill CDN. Adoption Rate	null_result	technical_feasibility (lossless compression) and payments_infrastructure	Reading fidelity high Study strength speculative	lossless KV compression and a cross-party payment layer are open problems 0.03

Precomputing and hosting LLM KV caches can eliminate repeated prefill work and cut per-document prefill costs by tens of times: on Qwen3-4B a popular 3,774-token document served to millions could cost millions less to deliver, creating a lucrative provider-hosted CDN opportunity.