The Commonplace
Home Papers Evidence Explore Syntheses Digests About 🎲 Workforce Futures
← Papers
Direction, evidence grade, and study type are AI-generated labels (gpt-5-mini), not human-verified. Syntheses are LLM-written. "Tensions" are machine-detected candidates, not confirmed contradictions. A research-acceleration tool, not peer review. How this is built →

Precomputing and hosting LLM KV caches can eliminate repeated prefill work and cut per-document prefill costs by tens of times: on Qwen3-4B a popular 3,774-token document served to millions could cost millions less to deliver, creating a lucrative provider-hosted CDN opportunity.

Can I Buy Your KV Cache?
Luoyuan Zhang · June 11, 2026
arxiv descriptive medium evidence 8/10 relevance Source PDF
Precomputing and provider-hosting KV caches for popular documents can reproduce full-prefill behavior exactly while reducing compute for prefill by roughly an order of magnitude to ~50x on the tested model, implying large cost savings and new CDN-like economics for agent workloads.

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

Summary

Main Finding

Loading a precomputed key–value (KV) cache for a token-identical document and continuing generation is token-exact (under greedy decoding) and massively cheaper than re-prefilling the same document. Measured on Qwen3-4B (fp16, MPS) reuse is 8.6–49.7× cheaper in compute than a full prefill across contexts of 255–3774 tokens, so a provider-hosted “prefill CDN” (publishers precompute and host KV; agents pay per cache-read) can immediately and substantially reduce aggregate compute cost and create a profitable market layer.

Key Points

  • Correctness
    • Loading a saved KV and continuing is token-exact vs. from-scratch prefill + greedy decode (24/24 token match in the experiment).
    • Maximum absolute logit difference observed ≈ 0.02 (attributed to FP nondeterminism), argmax matches.
  • Measured compute savings (Qwen3-4B, fp16, Apple M1 Pro / MPS; median of 5 runs)
    • Context lengths and speedups (prefill time vs one reuse-step time):
      • 255 tokens: 0.67 s vs 0.078 s → 8.6×
      • 485 tokens: 1.29 s vs 0.093 s → 13.9×
      • 945 tokens: 2.65 s vs 0.122 s → 21.7×
      • 1888 tokens: 5.81 s vs 0.179 s → 32.4×
      • 3774 tokens: 14.71 s vs 0.296 s → 49.7×
  • Size and compression
    • Analytic fp16 KV size ≈ 0.148 MB/token (36-layer model example); 3774 tokens → ~557 MB.
    • Per-tensor symmetric int8 halves artifact size (e.g., 692-token example 102.1 → 51.1 MB) but breaks token-exactness (only 16/32 tokens match before divergence in that probe).
  • Cost model and break-even
    • From-scratch per-call cost = C_prefill.
    • Reuse per-call cost = C_prefill / N + C_load; break-even reuse count N* = C_prefill / (C_prefill − C_load) ≈ 1 when C_load ≪ C_prefill → reuse pays off by the second read.
    • Include storage (σ), transfer/egress (β) and artifact size s: amortized per-load floor increases by βs + σsT/N; heavy reuse and compression lower these terms.
  • Economics & deployment
    • Shipping artifacts to consumers fails economically under lossless transfer because egress costs (commodity $0.09/GB) can exceed saved prefill cost; hosting provider-side removes egress.
    • Example (serving one hot 3774-token doc to N = 80M agents):
      • Re-prefill today: ~$1,509,600
      • Host + priced at a 0.1pin cache-read tariff: $150,960 (10× user discount compared to re-prefill)
      • Host + physical (measured Creuse/Cprefill = 1/49.7): $30,370 (49.7×)
      • Ship raw fp16 (egress): ~$3.9M (loses vs re-prefill)
      • Ship int8 (lossy): ~$2.0M (loses; inexact)
    • The commonly seen 0.1pin cache-read tariff gives users ~10× discount; measured compute ratios (8.6–49.7×) imply large provider margin potential on popular documents.
  • Scope and limitations
    • Study restricted to exact reuse of a single shared prefix (no cross-fragment fusion).
    • Exact reuse requires same model and dtype between publisher and consumer (model-bound).
    • Measurements from one device/model; absolute numbers are device/model-dependent.
    • Lossy compression can reduce size but forfeits exactness; lossless compression of KV is an open problem.

Data & Methods

  • Model and hardware
    • Qwen3-4B, fp16; measurements on Apple M1 Pro (MPS).
  • Algorithms (implemented/open-sourced)
    • SAVEKV: publisher runs one forward pass on context c = x1:L, serializes per-layer (K, V) with metadata (model id, dtype, L).
    • GENERATEFROMKV: consumer loads artifact, asserts model/dtype match, assigns correct position ids and full attention mask, decodes continuations using the loaded KV (no re-prefill).
  • Experiments
    • Correctness: greedy decoding token match (24/24) and logits-level check (max abs diff ≈ 0.02).
    • Compute timing: median of 5 runs for full prefill vs single continuation step over resident KV for contexts 255–3774 tokens (disk I/O excluded, resident KV).
    • Artifact size: analytic fp16 size and an empirical int8 probe for compression vs exactness trade-off.
  • Cost modeling
    • Simple per-call formulas and inclusion of storage (σ), egress (β), artifact size (s), hosting horizon T.
    • Worked example computing costs for large-scale serving (80M consumers) combining measured compute ratios and typical egress prices.

Implications for AI Economics

  • New market layer: a “prefill CDN” where publishers publish KV artifacts and agents pay per cache-read (hosted by providers) is economically attractive in ecosystems that standardize on model + dtype.
  • Provider economics and margins
    • Because Creuse ≪ Cprefill, providers can price cache-reads below the user’s re-prefill cost but above their marginal Creuse and still capture large margins on popular content.
    • Existing cache-read tariffs (e.g., 0.1pin) are within the physically measured envelope; the experimentally observed 8.6–49.7× compute savings justify at least a 10× user discount while leaving room for provider profit.
  • Incentives & market structure
    • Publishers can monetize one-time prefill compute; providers can monetize hosting/fast cache-reads.
    • Wide deployment requires standardized model identifiers, dtype agreements, provenance and invalidation mechanisms (when models update), and an agent-payment rail for cross-party settlement.
  • Open problems that matter economically
    • Lossless or high-quality selective compression that preserves token-exactness (or guarantees bounded divergence) — enables shipping artifacts or reduces hosting storage/egress exposure.
    • Cross-model / cross-dtype interoperability or migration strategies (how to handle model upgrades and invalidate or reissue artifacts).
    • Pricing, billing, and provenance standards for cross-party KV sharing and payments.
    • Fusion methods that recover much of the benefit for non-prefix reuse while bounding accuracy loss — expands the addressable reusable content beyond perfect prefixes.
  • Deployment guidance
    • Host KV in provider datacenters (avoid per-load egress) and bill reads; heavy reuse and longer documents yield larger absolute savings.
    • Expect immediate ROI for popular documents (N ≈ 2 suffices) and increasing returns as reuse scales.

Short summary: publishing and hosting precomputed KV caches (prefill CDN) is technically feasible, token-exact in the prefix case, and economically compelling for popular, long documents when the KV is hosted provider-side. Key engineering and market work remains around compression, standardization, model upgrade handling, and cross-party payments.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides direct measurements (token-exactness checks and compute cost comparisons) on a real model (Qwen3-4B) and a concrete long-document example, giving credible engineering evidence for large compute savings; however it does not empirically establish market-level economic outcomes, lacks broad cross-model/decoder robustness checks, real-world deployment results, or sensitivity to pricing/egress regimes. Methods Rigormedium — Experiments include deterministic token/logit-level comparisons and measured compute costs, which are appropriate for the technical claim, but the evaluation appears limited to a single model and decoding regime (greedy/logit equality), without statistical reporting, robustness across model sizes/architectures, or deployment-scale traces—so methodology is sound for a proof-of-concept but not exhaustive. SampleTechnical experiments on Qwen3-4B showing token-exact equivalence (24/24 greedy tokens and logits) between prefill and loading a precomputed KV cache, measured reuse compute savings of 9–50x depending on sequence length, and an illustrative cost calculation for serving one 3774-token "hot" document to 80M agents (estimated $1.5M re-prefill vs $0.03M reuse). Themesadoption innovation GeneralizabilityEvaluated on a single model (Qwen3-4B); results may differ for other architectures, sizes, or attention implementations, Token-exactness shown for greedy decoding; sampling/temperature/stochastic decoding may break exact equivalence, KV size, compressibility, and read costs depend on quantization, model implementation, and hardware; shipping vs host economics vary by provider and region, Assumes publishers are willing/able to host precomputed KVs and that providers will support cross-agent loading and payment flows—institutional and legal constraints (copyright, privacy) could limit adoption, Multi-turn, stateful interactions and dynamic prompts (user-specific context) reduce opportunities for reusable prefill, Latency, CDN design, and security (poisoned or private KV data) were not empirically explored

Claims (8)

ClaimDirectionOutcomeConfidence & EvidenceDetails
Right now, across the world, AI agents recompute a document's KV cache from scratch each time a document is read, so many agents redundantly re-run the compute-intensive prefill step on identical text. Organizational Efficiency negative redundant_compute_work (re-running prefill across agents)
Reading fidelity high
Study strength low
0.09
Loading a publisher-precomputed KV cache and continuing is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. Output Quality null_result output_equivalence (token-level and logits-level match between reuse and full prefill)
Reading fidelity high
Study strength medium
n=24
24/24 greedy tokens, and at the logits level
0.18
On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2). Organizational Efficiency positive compute_cost (reuse vs prefill)
Reading fidelity high
Study strength medium
9-50x cheaper in compute than prefill
0.18
Shipping the KV (client-side delivery) fails because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Adoption Rate negative egress_cost_vs_prefill_savings
Reading fidelity medium
Study strength medium
per-load egress costs more than the prefill it saves
0.11
Hosting the precomputed KV provider-side (removing egress) enables reuse without the egress cost, analogous to production prompt-caching. Adoption Rate positive egress_cost_elimination (by provider-side hosting)
Reading fidelity medium
Study strength medium
removes egress entirely
0.11
Serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). Firm Revenue positive aggregate_cost_savings (prefill vs reuse) for a high-demand document
Reading fidelity high
Study strength medium
n=80000000
$1.5M to re-prefill but only $0.03M of reuse compute (49.7x less)
0.18
APIs that charge a 0.1x cache-read tariff pass a 10x discount to users within this measured envelope; the measured ~50x compute saving exceeds that 10x, leaving provider margin measured in millions of dollars per popular document. Firm Revenue mixed user_price_discount and provider_margin (difference between compute saving and tariff pass-through)
Reading fidelity medium
Study strength medium
0.1x cache-read tariff -> 10x discount; measured ~50x compute saving
0.11
Lossless KV compression and a cross-party payment layer remain open problems for implementing an agent-native prefill CDN. Adoption Rate null_result technical_feasibility (lossless compression) and payments_infrastructure
Reading fidelity high
Study strength speculative
lossless KV compression and a cross-party payment layer are open problems
0.03

Notes