A Shannon-based scaling law shows why bigger or longer-trained LLMs sometimes get worse: if model size or data growth doesn't preserve signal-to-noise ratio, noise can dominate and induce U-shaped performance loss. The proposed law fits Pythia and OLMo2 loss basins better than classical power laws and correctly extrapolates to an unseen 12B model under multiple perturbations.
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.
Summary
Main Finding
The paper proposes the Shannon Scaling Law: a unified, noise-aware scaling formulation that treats an LLM as a noisy communication channel (Shannon–Hartley analogy). It maps model size (N) to channel bandwidth and training tokens (D) to signal power, explicitly models multiple noise sources, and defines an LLM capacity CLLM = a N^α log2(1 + b D^β / [c (D N)^γ + d D^δ + e]). Test loss is taken as L = 1/CLLM. This formulation (i) recovers standard monotonic power-law scaling in the high-SNR limit and (ii) predicts U-shaped loss basins (catastrophic overtraining, quantization-induced degradation) when noise dominates. Empirically it fits perturbed and unperturbed loss landscapes much better than prior laws, generalizes (extrapolates to unseen larger checkpoints / token budgets), and explains when scaling can reduce rather than improve downstream performance.
Key Points
- Conceptual shift: model training = information transmission over a noisy channel. Bandwidth ↔ model size; signal ↔ training tokens; noise ↔ data/model/irreducible sources.
- Shannon-form capacity for LLMs:
- numerator (signal): b D^β
- denominator (noise): c (D N)^γ (model-interaction noise) + d D^δ (data-induced noise) + e (irreducible)
- capacity scales multiplicatively with N^α (bandwidth) and logarithmically with SNR as in Shannon–Hartley.
- Loss mapping: test loss ≈ 1 / CLLM, producing strong nonlinearity and naturally producing U-shaped loss vs. N or D when noise grows faster than signal.
- Empirical phenomena captured:
- Catastrophic overtraining: increasing tokens can eventually worsen downstream SFT loss.
- Quantization-induced degradation (QiD): lower-precision inference can create loss basins; larger models may be more susceptible under fixed precision.
- Progressive sensitivity to noise: for fixed perturbation magnitude, larger token budgets amplify degradation.
- Performance vs. baselines:
- Outperforms monotonic power-law and recent perturbation-aware laws across Gaussian noise, supervised fine-tuning (SFT), and quantization tests.
- High fit quality: e.g., average R^2 ≈ 0.961±0.03 (Pythia) and 0.9585±0.06 (OLMo2) across Gaussian noise levels; SFT average R^2: GSM8K 0.936, SiQA 0.916, StarCoder 0.937.
- Extrapolation: fit on ≤6.9B Pythia models with ≤180B tokens predicted the unseen 12B model up to 307B tokens with pooled R^2 = 0.847; monotonic baselines failed.
Data & Methods
- Models:
- Pythia-dedup suite: 160M, 410M, 1B, 2.8B, 6.9B, 12B (stage-1 checkpoints; pretraining on deduplicated Pile).
- OLMo2 series: 1B, 7B, 13B, 32B (stage-1).
- Evaluation target: test loss on wikitext2; SFT downstream evaluation on GSM8K (math), SiQA (QA), StarCoder-Python (code).
- Perturbations studied:
- Gaussian weight noise injected at controlled SNRs (40, 30, 20, 15, 12, 10 dB). Noise added as N(0, σ_n^2) where σ_n^2 derived from weight power and SNR.
- Supervised fine-tuning (SFT) used as a perturbation via varying learning rates (1e-5 to 6e-4) while performing full fine-tuning across datasets.
- Quantization via GPTQ to 4-bit, 3-bit, 2-bit checkpoints and subsequent evaluation.
- Baselines compared:
- OpenAI power-law, Chinchilla additive law, QiD perturbation-aware law, Law of Precision (exponential degradation), and symmetric/asymmetric Chinchilla-derived variants.
- Fitting approach:
- All constants (a, b, c, d, e, α, β, γ, δ) are positive and fitted to observed losses. Goodness-of-fit measured by R^2 across perturbation regimes.
- Key empirical outcomes:
- Shannon law remains robust across perturbations and scales; other laws degrade or produce negative R^2 under strong perturbations.
- Visualized loss landscapes show transition from open monotonic contours (high SNR) to closed U-shaped basins (low SNR / high perturbation).
Implications for AI Economics
- Re-evaluating "scale at all costs": The Shannon view implies scaling model size or training tokens without maintaining SNR (data quality, precision, optimization stability) can produce negative returns — wasted compute, longer runtimes, and worse downstream performance. Economic decisions must consider both capacity and noise.
- Cost–benefit for scaling investments:
- Marginal benefit of increasing N or D depends on SNR regime. In high-SNR regimes, returns resemble standard power-law diminishing returns; in low-SNR regimes, returns can reverse.
- The law provides a quantitative predictor of where marginal returns turn negative (loss basin boundaries), enabling planners to avoid futile compute or data expenditure.
- Precision / hardware trade-offs:
- Lower-precision (cheaper) inference/training hardware can induce QiD; quantization savings must be weighed against increased risk of performance collapse for larger models. The model helps value the premium for higher bit-width or better error-correction to keep SNR above critical thresholds.
- Data curation and SNR management:
- Increasing token count without improving data quality increases data-induced noise (d D^δ). Investing in data cleaning, deduplication, and higher-quality corpora can be more cost-effective than raw scaling.
- Fine-tuning & product deployment:
- Aggressive fine-tuning (high LR or extended SFT) can act as a perturbation that reduces effective capacity; firms should budget for validation checkpoints, early stopping, or SNR-preserving fine-tuning protocols to avoid catastrophic overtraining costs.
- Forecasting & decision tools:
- The Shannon Scaling Law can be used as a forecasting tool for ROI on scale-up proposals (predicting when added compute/data will improve or harm performance), helping allocate R&D budgets and prioritize improvements (model size vs. data quality vs. precision).
- Market & competitive dynamics:
- Smaller players can exploit SNR-aware strategies (better data, higher-precision inference, robust fine-tuning) to outperform naïvely larger-but-noisier models, affecting strategic decisions about whether to compete on raw parameter counts.
- Recommended operational rules (practical economics):
- When scaling N, ensure proportional investments in data quality and noise-reduction (curation, regularization, precision).
- Before large compute spends, fit a noise-aware scaling surrogate (using the paper’s form) to past checkpoints to estimate the risk of entering a low-SNR basin.
- Use quantization-aware cost models that include expected performance loss from SNR reduction; compute savings must exceed expected revenue loss from degraded performance.
Limitations to consider when applying economically: - The law is empirical with fitted constants; transferring fits across architectures, datasets, or optimization setups requires caution. - The noise model assumes AWGN-like behavior and particular functional forms for noise terms; domain-specific noise (e.g., label bias, distribution shifts) may require refinements. - Primary evaluations used wikitext2 and selected SFT tasks; behavior on other downstream tasks or fully productionized pipelines may differ.
Summary: The Shannon Scaling Law gives AI decision-makers a principled, empirically validated way to predict when scale will pay off and when it will backfire. Incorporating noise and SNR in cost-benefit analyses can materially improve allocation of compute, data, and hardware precision investments.
Assessment
Claims (7)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. Other | negative | high | ability of prior scaling laws to explain non-monotonic performance phenomena (e.g., catastrophic overtraining, quantization-induced degradation) |
0.12
|
| We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem, mapping model parameters to channel bandwidth and training tokens to signal power. Other | mixed | high | conceptual modeling of LLM training dynamics as information transmission (theoretical fit/expressiveness) |
0.02
|
| This Shannon perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. Other | negative | high | performance vs. scale behavior (transition from monotonic improvement to U-shaped degradation due to SNR effects) |
0.12
|
| We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. Other | positive | high | empirical behavior of models under perturbations (robustness and fit to the proposed scaling law) across tasks |
0.12
|
| The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong R^2 scores and accurately capturing loss basins missed by prior approaches. Other | positive | high | goodness-of-fit (R^2) to observed loss/ performance curves and ability to capture loss basins |
0.12
|
| Fitted on 3.9B Pythia models with 30180B tokens, the Shannon Scaling Law predicts an unseen 12B model up to 307B tokens at pooled R^2=0.847, while monotonic baselines collapse. Other | positive | high | extrapolative predictive performance measured by pooled R^2 when predicting loss/performance for an unseen 12B model up to 307B tokens |
pooled R^2=0.847
0.12
|
| Monotonic baselines collapse when extrapolating beyond the training regime (e.g., predicting a 12B model up to 307B tokens) whereas the Shannon Scaling Law remains predictive. Other | negative | high | extrapolative predictive failure/success of baseline vs proposed scaling laws |
0.12
|