A Shannon-based scaling law shows why bigger or longer-trained LLMs sometimes get worse: if model size or data growth doesn't preserve signal-to-noise ratio, noise can dominate and induce U-shaped performance loss. The proposed law fits Pythia and OLMo2 loss basins better than classical power laws and correctly extrapolates to an unseen 12B model under multiple perturbations.

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

Xu Ouyang, Deyi Liu, Yuhang Cai, Jing Liu, Yuan Yang, Chen Zheng, Thomas Hartvigsen, Yiyuan Ma · May 22, 2026

arxiv theoretical medium evidence 7/10 relevance Source PDF

The paper introduces a Shannon-based scaling law that treats LLM training as information transmission and explains and predicts non-monotonic, noise-driven performance degradations (e.g., overtraining and quantization-induced drops), outperforming classical scaling laws in fits and extrapolation across Pythia and OLMo2 experiments.

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.

Summary

Main Finding

The paper proposes the Shannon Scaling Law: a unified, noise-aware scaling formulation that treats an LLM as a noisy communication channel (Shannon–Hartley analogy). It maps model size (N) to channel bandwidth and training tokens (D) to signal power, explicitly models multiple noise sources, and defines an LLM capacity CLLM = a N^α log2(1 + b D^β / [c (D N)^γ + d D^δ + e]). Test loss is taken as L = 1/CLLM. This formulation (i) recovers standard monotonic power-law scaling in the high-SNR limit and (ii) predicts U-shaped loss basins (catastrophic overtraining, quantization-induced degradation) when noise dominates. Empirically it fits perturbed and unperturbed loss landscapes much better than prior laws, generalizes (extrapolates to unseen larger checkpoints / token budgets), and explains when scaling can reduce rather than improve downstream performance.

Key Points

Conceptual shift: model training = information transmission over a noisy channel. Bandwidth ↔ model size; signal ↔ training tokens; noise ↔ data/model/irreducible sources.
Shannon-form capacity for LLMs:
- numerator (signal): b D^β
- denominator (noise): c (D N)^γ (model-interaction noise) + d D^δ (data-induced noise) + e (irreducible)
- capacity scales multiplicatively with N^α (bandwidth) and logarithmically with SNR as in Shannon–Hartley.
Loss mapping: test loss ≈ 1 / CLLM, producing strong nonlinearity and naturally producing U-shaped loss vs. N or D when noise grows faster than signal.
Empirical phenomena captured:
- Catastrophic overtraining: increasing tokens can eventually worsen downstream SFT loss.
- Quantization-induced degradation (QiD): lower-precision inference can create loss basins; larger models may be more susceptible under fixed precision.
- Progressive sensitivity to noise: for fixed perturbation magnitude, larger token budgets amplify degradation.
Performance vs. baselines:
- Outperforms monotonic power-law and recent perturbation-aware laws across Gaussian noise, supervised fine-tuning (SFT), and quantization tests.
- High fit quality: e.g., average R^2 ≈ 0.961±0.03 (Pythia) and 0.9585±0.06 (OLMo2) across Gaussian noise levels; SFT average R^2: GSM8K 0.936, SiQA 0.916, StarCoder 0.937.
- Extrapolation: fit on ≤6.9B Pythia models with ≤180B tokens predicted the unseen 12B model up to 307B tokens with pooled R^2 = 0.847; monotonic baselines failed.

Data & Methods

Models:
- Pythia-dedup suite: 160M, 410M, 1B, 2.8B, 6.9B, 12B (stage-1 checkpoints; pretraining on deduplicated Pile).
- OLMo2 series: 1B, 7B, 13B, 32B (stage-1).
Evaluation target: test loss on wikitext2; SFT downstream evaluation on GSM8K (math), SiQA (QA), StarCoder-Python (code).
Perturbations studied:
- Gaussian weight noise injected at controlled SNRs (40, 30, 20, 15, 12, 10 dB). Noise added as N(0, σ_n^2) where σ_n^2 derived from weight power and SNR.
- Supervised fine-tuning (SFT) used as a perturbation via varying learning rates (1e-5 to 6e-4) while performing full fine-tuning across datasets.
- Quantization via GPTQ to 4-bit, 3-bit, 2-bit checkpoints and subsequent evaluation.
Baselines compared:
- OpenAI power-law, Chinchilla additive law, QiD perturbation-aware law, Law of Precision (exponential degradation), and symmetric/asymmetric Chinchilla-derived variants.
Fitting approach:
- All constants (a, b, c, d, e, α, β, γ, δ) are positive and fitted to observed losses. Goodness-of-fit measured by R^2 across perturbation regimes.
Key empirical outcomes:
- Shannon law remains robust across perturbations and scales; other laws degrade or produce negative R^2 under strong perturbations.
- Visualized loss landscapes show transition from open monotonic contours (high SNR) to closed U-shaped basins (low SNR / high perturbation).

Implications for AI Economics

Re-evaluating "scale at all costs": The Shannon view implies scaling model size or training tokens without maintaining SNR (data quality, precision, optimization stability) can produce negative returns — wasted compute, longer runtimes, and worse downstream performance. Economic decisions must consider both capacity and noise.
Cost–benefit for scaling investments:
- Marginal benefit of increasing N or D depends on SNR regime. In high-SNR regimes, returns resemble standard power-law diminishing returns; in low-SNR regimes, returns can reverse.
- The law provides a quantitative predictor of where marginal returns turn negative (loss basin boundaries), enabling planners to avoid futile compute or data expenditure.
Precision / hardware trade-offs:
- Lower-precision (cheaper) inference/training hardware can induce QiD; quantization savings must be weighed against increased risk of performance collapse for larger models. The model helps value the premium for higher bit-width or better error-correction to keep SNR above critical thresholds.
Data curation and SNR management:
- Increasing token count without improving data quality increases data-induced noise (d D^δ). Investing in data cleaning, deduplication, and higher-quality corpora can be more cost-effective than raw scaling.
Fine-tuning & product deployment:
- Aggressive fine-tuning (high LR or extended SFT) can act as a perturbation that reduces effective capacity; firms should budget for validation checkpoints, early stopping, or SNR-preserving fine-tuning protocols to avoid catastrophic overtraining costs.
Forecasting & decision tools:
- The Shannon Scaling Law can be used as a forecasting tool for ROI on scale-up proposals (predicting when added compute/data will improve or harm performance), helping allocate R&D budgets and prioritize improvements (model size vs. data quality vs. precision).
Market & competitive dynamics:
- Smaller players can exploit SNR-aware strategies (better data, higher-precision inference, robust fine-tuning) to outperform naïvely larger-but-noisier models, affecting strategic decisions about whether to compete on raw parameter counts.
Recommended operational rules (practical economics):
- When scaling N, ensure proportional investments in data quality and noise-reduction (curation, regularization, precision).
- Before large compute spends, fit a noise-aware scaling surrogate (using the paper’s form) to past checkpoints to estimate the risk of entering a low-SNR basin.
- Use quantization-aware cost models that include expected performance loss from SNR reduction; compute savings must exceed expected revenue loss from degraded performance.

Limitations to consider when applying economically: - The law is empirical with fitted constants; transferring fits across architectures, datasets, or optimization setups requires caution. - The noise model assumes AWGN-like behavior and particular functional forms for noise terms; domain-specific noise (e.g., label bias, distribution shifts) may require refinements. - Primary evaluations used wikitext2 and selected SFT tasks; behavior on other downstream tasks or fully productionized pipelines may differ.

Summary: The Shannon Scaling Law gives AI decision-makers a principled, empirically validated way to predict when scale will pay off and when it will backfire. Incorporating noise and SNR in cost-benefit analyses can materially improve allocation of compute, data, and hardware precision investments.

Assessment

Paper Typetheoretical Evidence Strengthmedium — The theory is backed by quantitative fits (high R^2) and successful extrapolation to an unseen 12B model up to 307B tokens and by experiments under multiple perturbations (Gaussian noise, quantization, fine-tuning). However, empirical validation is limited to a small set of model families (Pythia, OLMo2), a restricted parameter/token range for fitting, and specific noise models and tasks, leaving open risks of overfitting to these settings and uncertainty about broader applicability. Methods Rigormedium — The paper provides a principled theoretical derivation grounded in information theory and systematically evaluates predictive performance (including extrapolation) and perturbation scenarios, but it relies on modeling assumptions (e.g., the chosen noise model, mapping of parameters to bandwidth), fits on a limited set of architectures/datasets, and does not fully explore alternative specifications, optimizer/regularization effects, or a wide diversity of training regimes and hardware-induced quantization behaviors. SampleEmpirical fits use Pythia models up to 6.9B parameters trained on up to 180B tokens (used for fitting); validation includes an unseen 12B Pythia model evaluated up to 307B tokens; additional experiments on OLMo2. Perturbations tested include injected Gaussian noise, various quantization schemes, and supervised fine-tuning on math, QA and code tasks; performance evaluated via loss curves and predictive R^2 against classical and perturbation-aware scaling baselines. Themesinnovation adoption IdentificationNo causal identification in the experimental sense; the paper develops a theoretical mapping (model parameters -> channel bandwidth, training tokens -> signal power) based on the Shannon-Hartley theorem and tests this mapping by fitting the proposed scaling law to observed loss curves and evaluating out-of-sample predictive performance across models and perturbations. GeneralizabilityValidated on limited model families (Pythia and OLMo2); may not generalize to substantially different architectures or pretraining corpora, Fitted on models up to 6.9B and extrapolated to 12B—uncertain behavior for much larger models (100B+ parameters) or different optimization/regimes, Relies on specific noise model assumptions (e.g., Gaussian noise, mapping of params to bandwidth) that may not capture all sources of training or deployment noise, Perturbation types and downstream tasks (math, QA, code) are limited; other tasks or real-world deployment conditions may show different dynamics, Hardware- and implementation-specific quantization effects could differ from the quantization models used in experiments

Claims (7)

Claim	Direction	Confidence	Outcome	Details
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. Other	negative	high	ability of prior scaling laws to explain non-monotonic performance phenomena (e.g., catastrophic overtraining, quantization-induced degradation)	0.12
We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem, mapping model parameters to channel bandwidth and training tokens to signal power. Other	mixed	high	conceptual modeling of LLM training dynamics as information transmission (theoretical fit/expressiveness)	0.02
This Shannon perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. Other	negative	high	performance vs. scale behavior (transition from monotonic improvement to U-shaped degradation due to SNR effects)	0.12
We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. Other	positive	high	empirical behavior of models under perturbations (robustness and fit to the proposed scaling law) across tasks	0.12
The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong R^2 scores and accurately capturing loss basins missed by prior approaches. Other	positive	high	goodness-of-fit (R^2) to observed loss/ performance curves and ability to capture loss basins	0.12
Fitted on 3.9B Pythia models with 30180B tokens, the Shannon Scaling Law predicts an unseen 12B model up to 307B tokens at pooled R^2=0.847, while monotonic baselines collapse. Other	positive	high	extrapolative predictive performance measured by pooled R^2 when predicting loss/performance for an unseen 12B model up to 307B tokens	pooled R^2=0.847 0.12
Monotonic baselines collapse when extrapolating beyond the training regime (e.g., predicting a 12B model up to 307B tokens) whereas the Shannon Scaling Law remains predictive. Other	negative	high	extrapolative predictive failure/success of baseline vs proposed scaling laws	0.12