Scaling laws work because 'compute' abstracts away implementation details — but that abstraction also fuels a persistent efficiency race: as loss improvements flatten, progress depends on continual cost and systems innovation to translate real resources into logical compute.
Classical AI scaling laws, especially for pre-training, describe how training loss decreases with compute in a power-law form. Their effectiveness has a basic and very practical sense: they make progress predictable, albeit at a declining rate. Yet their effectiveness is also unreasonable in two further senses. First, these laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes. Second, despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token. This paper argues that both features arise from the same source: scaling laws are unusually effective because they abstract away from many realization details. The compute variable is best understood as logical compute, an implementation-agnostic notion of model-side work, while the practical burden of scaling depends on how efficiently real resources are converted into that compute. This abstraction helps explain both why the laws travel so well across settings and why they give rise to a persistent efficiency game in hardware, algorithms, and systems. Once efficiency is made explicit, the main practical question becomes how many efficiency doublings are required to keep scaling productive despite diminishing returns. Under that view, diminishing returns are not only a geometric flattening of the loss curve, but also rising pressure for cost reduction, system-level innovation, and the breakthroughs needed to sustain Moore-like efficiency doublings.
Summary
Main Finding
Classical compute–loss scaling laws remain predictive because their compute variable is an abstraction — "logical compute" — that suppresses many implementation details. Once you separate logical compute from the efficiency with which it is delivered (Elogical), diminishing returns in the static scaling law become rising operational burden in practice. Continued progress therefore requires repeated, sustained efficiency doublings in hardware, algorithms, and systems; the calendar-time pace of improvement depends jointly on the loss exponent κ and the annual efficiency-doubling rate β.
Key Points
- Interpretive claim: The C in L(C) ≈ E + K C^(−κ) is best read as logical compute — an abstract, model-side measure (dense, uniform-precision FLOP-equivalents) — not as a specific physical FLOP count tied to particular hardware/precision/kernels.
- Hidden efficiency: Practical cost to realize logical compute depends on an efficiency term Elogical (logical FLOPs per joule or per unit physical resource). Physical resource burden (energy, time, power, systems effort) is PT = Clogical / Elogical.
- Time-indexed extension: If efficiency doubles at rate β (doublings per year) and yearly physical resource budget is roughly P0, cumulative logical compute by time t is C(t) = C0 [1 + (2^(βt) − 1) / (β ln 2)], where C0 = E0 P0 is the initial-year logical-compute throughput.
- Calendar-time loss dynamics: Combining C(t) with L(C) ≈ E + K C^(−κ) gives relative excess loss over time X(t) = [1 + (2^(βt) − 1) / (β ln 2)]^(−κ). Thus progress over years depends on both κ and β: small κ (strong diminishing returns) can be offset by large β (fast efficiency improvements).
- Operational meaning of diminishing returns: To reach target loss Ltarget > E, Clogical(Ltarget) ∝ (Ltarget − E)^(−1/κ), and physical burden scales as PT(Ltarget) ∝ (Ltarget − E)^(−1/κ) / Elogical. Therefore approaching the irreducible floor rapidly increases operational costs unless Elogical improves.
- Why scaling laws "travel": Because they abstract over implementation details, the same logical-compute → loss relation holds across architectures/precisions/optimizations; improvements in those implementation details show up as increases in Elogical, allowing continued movement along the same law rather than breaking it.
Data & Methods
- Method: theoretical/interpretive analysis plus a minimalist dynamic model. Starts from the empirical separable loss model L(N,D) = E + A N^(−α) + B D^(−β) and the compute approximation C ∝ N D to derive the classical compute-only law with exponent κ = αβ/(α+β).
- Extension: makes Elogical explicit and models Elogical(t) = E0 2^(β t) (β = doublings per year). Assumes a roughly constant physical resource contribution per year (P0) and integrates logical-compute throughput over time to get C(t).
- Empirical grounding: builds on empirical scaling-law literature (e.g., Kaplan et al. 2020; Hoffmann et al. 2022) and recent extensions that incorporate precision, sparsity, and inference-aware metrics (cited papers). The paper does not fit new empirical data but uses established empirical forms to motivate the interpretive extension.
- Key assumptions/limitations:
- Logical compute is measured against a dense, uniform-precision reference; engineering optimizations are treated as changes in Elogical, not changes to the scaling law.
- Assumes (for tractability) a constant annual physical resource budget and a steady exponential (doubling) improvement in efficiency.
- Ignores some economic feedbacks (e.g., changing willingness to pay, supply constraints, capital reallocation) and potential structural breaks in scaling behavior.
Implications for AI Economics
- The locus of competition: Because the scaling law fixes what logical compute buys, firms compete primarily on the efficiency stack (hardware, power provisioning, kernels, quantization, sparsity, routing, systems). Margins and rents accrue to entities that can deliver more logical compute per unit resource.
- Investment priorities: Returns to investment are high in R&D and capital that raise Elogical (better chips, energy efficiency, compilers/kernels, sparsity/quantization methods, systems software) because each efficiency doubling unlocks further movement along the same compute–loss curve.
- Operational costs and pricing: Diminishing returns translate into rising operational (energy, time, engineering) costs to reach incremental capability; this shapes model sizing decisions, service pricing, and adoption thresholds for compute-intensive capabilities.
- Market structure and specialization: Persistent pressure for efficiency favors both scale (large datacenters, energy contracts) and specialized engineering (application-specific accelerators, algorithmic sparsity). The paper’s companion work suggests this can raise the bar for specialization and tilt optimal allocation toward programmable substrates with favorable efficiency growth prospects.
- Resource & policy considerations: Energy supply, power caps, and infrastructural constraints become binding for long-run progress. Policy levers that affect energy availability, data-center siting, or R&D incentives (subsidies, standards) can materially affect β and thus the pace of AI capability growth.
- Risk & forecasting: Forecasts of AI capability over calendar time must account for rates of efficiency improvement, not only available nominal compute budgets. Scenarios with slowed efficiency growth (lower β) could sharply reduce realized capability progress even if logical-compute scaling laws still hold.
- Economic externalities: Because improving Elogical can shift who captures value (hardware vendors, cloud providers, model developers), changes in the efficiency stack have redistributive effects across the AI value chain.
Short takeaway: scaling laws tell you how much logical compute matters for loss, but predicting capability and economic outcomes over time requires modeling how efficiently that logical compute can be produced. Continued, economically meaningful AI progress hinges on repeated efficiency doublings as much as on the static shape of the loss–compute curve.
Assessment
Claims (9)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Classical AI scaling laws, especially for pre-training, describe how training loss decreases with compute in a power-law form. Output Quality | positive | high | training loss |
0.12
|
| Scaling laws make progress predictable, albeit at a declining rate. Research Productivity | positive | high | predictability of progress (model performance as compute increases) |
0.12
|
| Scaling laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes. Research Productivity | positive | high | generalizability (occurrence) of scaling-law patterns across model families and regimes |
0.12
|
| Despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token. Firm Productivity | positive | high | cost per token and continued progress (performance improvements over time) |
0.12
|
| The compute variable is best understood as logical compute, an implementation-agnostic notion of model-side work. Other | mixed | high | definition/interpretation of the 'compute' variable |
0.02
|
| The practical burden of scaling depends on how efficiently real resources are converted into that (logical) compute. Training Effectiveness | mixed | high | efficiency of converting real resources into logical compute |
0.02
|
| This abstraction (logical compute) helps explain both why the laws travel so well across settings and why they give rise to a persistent efficiency game in hardware, algorithms, and systems. Innovation Output | mixed | medium | extent of efficiency-driven innovation and cross-setting generality of scaling laws |
0.01
|
| Once efficiency is made explicit, the main practical question becomes how many efficiency doublings are required to keep scaling productive despite diminishing returns. Organizational Efficiency | mixed | high | required number of efficiency doublings to sustain productive scaling |
0.02
|
| Diminishing returns are not only a geometric flattening of the loss curve, but also rising pressure for cost reduction, system-level innovation, and the breakthroughs needed to sustain Moore-like efficiency doublings. Innovation Output | negative | high | pressure for cost reduction and need for system-level innovation/breakthroughs |
0.02
|