Precomputing and hosting LLM KV caches can eliminate repeated prefill work and cut per-document prefill costs by tens of times: on Qwen3-4B a popular 3,774-token document served to millions could cost millions less to deliver, creating a lucrative provider-hosted CDN opportunity.
Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.
Summary
Main Finding
Loading a precomputed key–value (KV) cache for a token-identical document and continuing generation is token-exact (under greedy decoding) and massively cheaper than re-prefilling the same document. Measured on Qwen3-4B (fp16, MPS) reuse is 8.6–49.7× cheaper in compute than a full prefill across contexts of 255–3774 tokens, so a provider-hosted “prefill CDN” (publishers precompute and host KV; agents pay per cache-read) can immediately and substantially reduce aggregate compute cost and create a profitable market layer.
Key Points
- Correctness
- Loading a saved KV and continuing is token-exact vs. from-scratch prefill + greedy decode (24/24 token match in the experiment).
- Maximum absolute logit difference observed ≈ 0.02 (attributed to FP nondeterminism), argmax matches.
- Measured compute savings (Qwen3-4B, fp16, Apple M1 Pro / MPS; median of 5 runs)
- Context lengths and speedups (prefill time vs one reuse-step time):
- 255 tokens: 0.67 s vs 0.078 s → 8.6×
- 485 tokens: 1.29 s vs 0.093 s → 13.9×
- 945 tokens: 2.65 s vs 0.122 s → 21.7×
- 1888 tokens: 5.81 s vs 0.179 s → 32.4×
- 3774 tokens: 14.71 s vs 0.296 s → 49.7×
- Context lengths and speedups (prefill time vs one reuse-step time):
- Size and compression
- Analytic fp16 KV size ≈ 0.148 MB/token (36-layer model example); 3774 tokens → ~557 MB.
- Per-tensor symmetric int8 halves artifact size (e.g., 692-token example 102.1 → 51.1 MB) but breaks token-exactness (only 16/32 tokens match before divergence in that probe).
- Cost model and break-even
- From-scratch per-call cost = C_prefill.
- Reuse per-call cost = C_prefill / N + C_load; break-even reuse count N* = C_prefill / (C_prefill − C_load) ≈ 1 when C_load ≪ C_prefill → reuse pays off by the second read.
- Include storage (σ), transfer/egress (β) and artifact size s: amortized per-load floor increases by βs + σsT/N; heavy reuse and compression lower these terms.
- Economics & deployment
- Shipping artifacts to consumers fails economically under lossless transfer because egress costs (commodity $0.09/GB) can exceed saved prefill cost; hosting provider-side removes egress.
- Example (serving one hot 3774-token doc to N = 80M agents):
- Re-prefill today: ~$1,509,600
- Host + priced at a 0.1pin cache-read tariff: $150,960 (10× user discount compared to re-prefill)
- Host + physical (measured Creuse/Cprefill = 1/49.7): $30,370 (49.7×)
- Ship raw fp16 (egress): ~$3.9M (loses vs re-prefill)
- Ship int8 (lossy): ~$2.0M (loses; inexact)
- The commonly seen 0.1pin cache-read tariff gives users ~10× discount; measured compute ratios (8.6–49.7×) imply large provider margin potential on popular documents.
- Scope and limitations
- Study restricted to exact reuse of a single shared prefix (no cross-fragment fusion).
- Exact reuse requires same model and dtype between publisher and consumer (model-bound).
- Measurements from one device/model; absolute numbers are device/model-dependent.
- Lossy compression can reduce size but forfeits exactness; lossless compression of KV is an open problem.
Data & Methods
- Model and hardware
- Qwen3-4B, fp16; measurements on Apple M1 Pro (MPS).
- Algorithms (implemented/open-sourced)
- SAVEKV: publisher runs one forward pass on context c = x1:L, serializes per-layer (K, V) with metadata (model id, dtype, L).
- GENERATEFROMKV: consumer loads artifact, asserts model/dtype match, assigns correct position ids and full attention mask, decodes continuations using the loaded KV (no re-prefill).
- Experiments
- Correctness: greedy decoding token match (24/24) and logits-level check (max abs diff ≈ 0.02).
- Compute timing: median of 5 runs for full prefill vs single continuation step over resident KV for contexts 255–3774 tokens (disk I/O excluded, resident KV).
- Artifact size: analytic fp16 size and an empirical int8 probe for compression vs exactness trade-off.
- Cost modeling
- Simple per-call formulas and inclusion of storage (σ), egress (β), artifact size (s), hosting horizon T.
- Worked example computing costs for large-scale serving (80M consumers) combining measured compute ratios and typical egress prices.
Implications for AI Economics
- New market layer: a “prefill CDN” where publishers publish KV artifacts and agents pay per cache-read (hosted by providers) is economically attractive in ecosystems that standardize on model + dtype.
- Provider economics and margins
- Because Creuse ≪ Cprefill, providers can price cache-reads below the user’s re-prefill cost but above their marginal Creuse and still capture large margins on popular content.
- Existing cache-read tariffs (e.g., 0.1pin) are within the physically measured envelope; the experimentally observed 8.6–49.7× compute savings justify at least a 10× user discount while leaving room for provider profit.
- Incentives & market structure
- Publishers can monetize one-time prefill compute; providers can monetize hosting/fast cache-reads.
- Wide deployment requires standardized model identifiers, dtype agreements, provenance and invalidation mechanisms (when models update), and an agent-payment rail for cross-party settlement.
- Open problems that matter economically
- Lossless or high-quality selective compression that preserves token-exactness (or guarantees bounded divergence) — enables shipping artifacts or reduces hosting storage/egress exposure.
- Cross-model / cross-dtype interoperability or migration strategies (how to handle model upgrades and invalidate or reissue artifacts).
- Pricing, billing, and provenance standards for cross-party KV sharing and payments.
- Fusion methods that recover much of the benefit for non-prefix reuse while bounding accuracy loss — expands the addressable reusable content beyond perfect prefixes.
- Deployment guidance
- Host KV in provider datacenters (avoid per-load egress) and bill reads; heavy reuse and longer documents yield larger absolute savings.
- Expect immediate ROI for popular documents (N ≈ 2 suffices) and increasing returns as reuse scales.
Short summary: publishing and hosting precomputed KV caches (prefill CDN) is technically feasible, token-exact in the prefix case, and economically compelling for popular, long documents when the KV is hosted provider-side. Key engineering and market work remains around compression, standardization, model upgrade handling, and cross-party payments.
Assessment
Claims (8)
| Claim | Direction | Outcome | Confidence & Evidence | Details |
|---|---|---|---|---|
| Right now, across the world, AI agents recompute a document's KV cache from scratch each time a document is read, so many agents redundantly re-run the compute-intensive prefill step on identical text. Organizational Efficiency | negative | redundant_compute_work (re-running prefill across agents) |
Reading fidelity
high
Study strength
low
|
|
| Loading a publisher-precomputed KV cache and continuing is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. Output Quality | null_result | output_equivalence (token-level and logits-level match between reuse and full prefill) |
Reading fidelity
high
Study strength
medium
|
n=24
24/24 greedy tokens, and at the logits level
|
| On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2). Organizational Efficiency | positive | compute_cost (reuse vs prefill) |
Reading fidelity
high
Study strength
medium
|
9-50x cheaper in compute than prefill
|
| Shipping the KV (client-side delivery) fails because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Adoption Rate | negative | egress_cost_vs_prefill_savings |
Reading fidelity
medium
Study strength
medium
|
per-load egress costs more than the prefill it saves
|
| Hosting the precomputed KV provider-side (removing egress) enables reuse without the egress cost, analogous to production prompt-caching. Adoption Rate | positive | egress_cost_elimination (by provider-side hosting) |
Reading fidelity
medium
Study strength
medium
|
removes egress entirely
|
| Serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). Firm Revenue | positive | aggregate_cost_savings (prefill vs reuse) for a high-demand document |
Reading fidelity
high
Study strength
medium
|
n=80000000
$1.5M to re-prefill but only $0.03M of reuse compute (49.7x less)
|
| APIs that charge a 0.1x cache-read tariff pass a 10x discount to users within this measured envelope; the measured ~50x compute saving exceeds that 10x, leaving provider margin measured in millions of dollars per popular document. Firm Revenue | mixed | user_price_discount and provider_margin (difference between compute saving and tariff pass-through) |
Reading fidelity
medium
Study strength
medium
|
0.1x cache-read tariff -> 10x discount; measured ~50x compute saving
|
| Lossless KV compression and a cross-party payment layer remain open problems for implementing an agent-native prefill CDN. Adoption Rate | null_result | technical_feasibility (lossless compression) and payments_infrastructure |
Reading fidelity
high
Study strength
speculative
|
lossless KV compression and a cross-party payment layer are open problems
|