Sign-based optimizers can be provably d-times more efficient than SGD when gradient noise is coordinate-sparse and the problem satisfies l_infty-smoothness, and the matrix extension Muon preserves this scaling; a 124M-parameter GPT-2 pretraining run exhibits the predicted faster convergence.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

Hongyi Tao, Dingzhi Yu, Lijun Zhang · May 07, 2026

arxiv theoretical medium evidence 7/10 relevance Source PDF

Under l1-stationarity, l_infty-smoothness, and separable (sparse) noise, SignSGD (and its matrix variant Muon) attains provably better complexity than SGD—reducing complexity by a factor of d—with matched upper/lower bounds, and a GPT-2 pretraining run shows faster convergence consistent with the theory.

Sign-based optimization algorithms, such as SignSGD and Muon, have garnered significant attention for their remarkable performance in training large foundation models. Despite this empirical success, we still lack a theoretical understanding of when and why these sign-based methods outperform vanilla SGD. The core obstacle is that under standard smoothness and finite variance conditions, SGD is known to be minimax optimal for finding stationary points measured by $\ell_2$-norms, thereby fundamentally precluding any complexity gains for sign-based methods in standard settings. To overcome this barrier, we analyze sign-based optimizers leveraging $\ell_1$-norm stationarity, $\ell_\infty$-smoothness, and a separable noise model, which can better capture the coordinate-wise nature of signed updates. Under this distinct problem geometry, we derive matched upper and lower bounds for SignSGD and explicitly characterize the problem class in which SignSGD provably dominates SGD. Specifically, we compare the \emph{upper bound of SignSGD} with the \emph{lower bound of SGD}, illustrating that SignSGD effectively reduces the complexity by a factor of $d$ under \emph{sparse noise}, where $d$ is the problem dimension. Furthermore, we elevate this framework to the matrix domain, providing an equivalent optimal lower bound for the Muon optimizer, proving that extending the sign operator to matrices preserves this optimal scaling with dimensionality. Finally, we bridge our theoretical bounds to practice, demonstrating that the theoretical superiority of SignSGD accurately predicts its faster convergence during the pretraining of a 124M parameter GPT-2 model.

Summary

Main Finding

SignSGD (and its matrix analogue Muon) can provably outperform vanilla SGD when we change the optimization geometry from the usual ℓ2-based setup to an ℓ∞-geometry with coordinate-wise (separable) noise and measure stationarity in the ℓ1 (or nuclear) norm. Under ℓ∞-smoothness and separable variance, the authors derive matched upper and lower bounds showing SignSGD attains a strictly better dimension dependence than SGD when noise is sparse or highly heterogeneous — in extreme cases yielding an O(d) improvement (d = problem dimension). The framework extends to matrices (Muon) with analogous optimal scaling.

Key Points

Geometry shift: Replace standard ℓ2-smoothness + ℓ2-stationarity with:
- ℓ∞-smoothness (bounds based on max coordinate displacement),
- separable (coordinate-wise) noise model (variance vector σ),
- ℓ1-stationarity (natural for sign-based updates).
Tight rates for SignSGD:
- With total sample complexity N = BT, L∞ the ℓ∞ smoothness constant, and Δ = f(x0) − inf f, E[min_t ∥∇f(xt)∥1] = Θ( sqrt( L∞Δ / N ) + ( (∥σ∥1^2 L∞Δ / N) )^{1/4} ).
- The paper proves both matching upper and lower bounds for SignSGD under these assumptions.
SGD lower bound under same setting:
- For vanilla SGD (B=1, N=T), they show E[min_t ∥∇f(xt)∥1] = Ω( sqrt( d L∞Δ / N ) + ( d ∥σ∥2^2 L∞Δ / N )^{1/4} ), i.e., an extra factor ~d enters the complexity in worst-case constructions.
Dimensional gain condition:
- Define noise-density ϕ(σ) = ∥σ∥1^2 / (d ∥σ∥2^2) ∈ [1/d, 1].
- When ϕ(σ) is small (noise concentrated in a few coordinates, i.e., sparse/heterogeneous noise), SignSGD reduces dimension dependence and can give up to a factor d improvement over SGD.
Matrix extension (Muon):
- By lifting the same ℓ∞/separable intuition to matrix spectral/nuclear norms, the paper derives matching upper/lower bounds for Muon (nuclear-norm stationarity) and shows the same favorable dimensional scaling carries over to matrix-structured parameters.
Empirical validation:
- Theoretical predictions align with experiments: pretraining a 124M-parameter GPT-2 model demonstrates faster convergence of SignSGD in the regime analyzed. Code released: https://github.com/Dingzhen230/SignSGD_Outperforms_SGD

Data & Methods

Assumptions
- Assumption 1: f bounded below; Δ = f(x0) − inf f finite.
- Assumption 2 (Separable noise): stochastic gradients are unbiased per mini-batch; coordinate-wise conditional variances are bounded by σ_i^2 independently.
- Assumption 3 (ℓ∞-smoothness): |f(y) − (f(x) + ⟨∇f(x), y−x⟩)| ≤ (L∞/2) ∥y−x∥_∞^2.
- (Also discuss separable smoothness as a related assumption: vector L with ∥L∥1 = L∞.)
Algorithms studied
- SignSGD: xt+1 = xt − η sign(gt).
- Muon: matrix analogue using msign(G) = U V^⊤ for G = UΣV^⊤ (approximated in practice by Newton–Schulz).
Theoretical contributions
- Upper bound for SignSGD: derive convergence in expected ℓ1-stationarity with explicit dependence on L∞, Δ, N and ∥σ∥1.
- Lower bound for SignSGD: construct hard (separable) instances and adversarial bimodal noise oracle to show the upper bound is tight (matching rate).
- Lower bound for SGD in same ℓ∞/separable setting: adapt coordinate-wise hard instances and optimize curvature allocation to show SGD must incur a d factor in worst-case.
- Matrix setting: lift constructions to matrices (spectral smoothness / nuclear-norm stationarity) and produce matching upper and lower bounds for Muon.
Proof techniques (sketch)
- Dimensional decomposition: construct separable objective f(x) = Σ p_i(x_i) to reduce to d independent 1D problems.
- Resisting oracle: in each coordinate, design functions with persistent slope that force stall unless many queries are used.
- Adversarial bimodal noise: design unbiased noise distribution that preserves coordinate variance but delays progress.
- Adversarial allocation of curvature and Δ across coordinates to maximize the SGD lower bound and exhibit the d-dependence.
Experiments
- Pretraining behavior measured on 124M GPT-2 model; observed faster convergence for SignSGD consistent with theory. Implementation and details available in repo.

Implications for AI Economics

Training cost and time: When gradient noise is sparse/heterogeneous (empirically common in deep networks), SignSGD-like methods can reduce the effective sample/iteration complexity by up to a factor of d, implying substantial compute and wall-clock savings in pretraining large models. This lowers monetary and energy costs for organizations that operate in regimes matching the paper’s assumptions.
Hardware and precision choices: Sign-based methods are naturally compatible with low-precision (1-bit sign) communication and computation; the theoretical advantage under separable noise strengthens the case for investing in hardware and communication systems optimized for sign/quantized operations.
Distributed training and communication: The results provide a theoretical justification for adopting sign-based updates in large distributed setups (reduced bandwidth, robustness to coordinate-wise noise), which can change cost models and network design decisions for large-scale training clusters.
Algorithmic choice and specialization: The paper identifies precise problem characteristics (ℓ∞-geometry, separable noise, ℓ1-optimality) where sign-based optimizers dominate. Firms and researchers should profile gradient-noise structure in their workloads to decide when to adopt SignSGD/Muon rather than defaulting to SGD/Adam-family methods — this can be a high-leverage, low-cost optimization of training pipelines.
Market and competitive dynamics: Teams that exploit sign-based optimizers in appropriate regimes can gain cost and speed advantages, potentially accelerating model iteration cycles and lowering barriers to large model training for actors with limited compute budgets.
Investment and R&D prioritization: The work suggests value in investing in:
- Tools for measuring coordinate-wise noise heterogeneity in real training jobs,
- Engineering deterministic mechanisms (like Muon approximations) and implementations optimized for matrix-sign operations,
- Low-precision interconnects and sign-friendly accelerators.
Limitations and risk for economic decisions
- Applicability depends on assumptions (ℓ∞-smoothness, separable noise, ℓ1-stationarity). In settings close to canonical ℓ2-based assumptions or with homogeneous noise, SGD remains minimax-optimal; switching blindly could hurt performance.
- Worst-case constructions show SGD can still be better in different geometries — firms must empirically verify noise sparsity before adopting sign-based optimizers at scale.
- Potential shifts in who benefits: organizations with strong expertise in diagnosing and tuning for noise structure may capture larger gains, affecting competitive balance.

Overall, the paper gives a rigorous, instance-aware theoretical foundation for when sign-based optimization is genuinely advantageous and connects those conditions to practical pretraining evidence — a useful guide for cost-sensitive decisions in large-scale AI development.

Assessment

Paper Typetheoretical Evidence Strengthmedium — The paper gives strong theoretical guarantees (matched upper and lower bounds) within a clearly specified mathematical framework, which is convincing for the stated problem class; however the key assumptions (l_infty smoothness, l1-stationarity metric, separable/sparse noise) are nonstandard and their fit to real-world deep-learning loss landscapes is only partially supported by a single empirical pretraining experiment, limiting external validity. Methods Rigorhigh — Rigor is high: the authors provide matched upper and lower complexity bounds, extend results from vectors to matrices, and formally characterize the regime (sparse coordinate-wise noise) where sign-based methods gain a factor d; proofs plus a targeted empirical check indicate careful methodology, though empirical scope is limited. SampleTheoretical analyses consider d-dimensional nonconvex optimization under l_infty-smoothness and a separable (coordinate-wise) noise model, and a matrix analog for Muon; empirical component consists of a pretraining run of a 124M-parameter GPT-2 model comparing SignSGD/Muon to SGD (dataset and full training details not specified in the summary). Themesproductivity innovation IdentificationDerive formal upper bounds for SignSGD (and Muon) under a problem class defined by l1-stationarity, l_infty-smoothness, and a separable (coordinate-wise) noise model; derive matching lower bounds for SGD (and corresponding matrix lower bounds) and compare SignSGD's upper bound to SGD's lower bound to establish regimes where SignSGD provably outperforms SGD; validate the theoretical prediction with an empirical pretraining run on a 124M-parameter GPT-2. GeneralizabilityRelies on nonstandard assumptions (l_infty-smoothness and l1-stationarity) that may not hold across typical deep-learning loss surfaces, Separable / sparse noise model may not reflect gradient noise in all architectures, tasks, or batch regimes, Empirical validation limited to one model size (124M GPT-2) and unspecified pretraining data/hyperparameter sweeps, Worst-case lower bounds and constructed problem classes may not capture average-case practical behavior, Matrix extension proofs may assume structure (e.g., separability across entries) that fails for some multi-parameter layers

Claims (6)

Claim	Direction	Confidence	Outcome	Details
Under standard smoothness and finite variance conditions, SGD is minimax optimal for finding stationary points measured by l2-norms, thereby fundamentally precluding any complexity gains for sign-based methods in standard settings. Other	null_result	high	minimax optimality for finding l2-norm stationary points (optimization complexity / worst-case lower bound)	0.2
By analyzing sign-based optimizers under l1-norm stationarity, l_infty-smoothness, and a separable noise model, we can better capture the coordinate-wise nature of signed updates and overcome the barrier that prevents sign-based methods from outperforming SGD in standard settings. Other	positive	high	applicability of sign-based optimizer analysis and potential for improved convergence (theoretical conditions under which improvements are possible)	0.12
Under this distinct problem geometry (l1-stationarity, l_infty-smoothness, separable noise), we derive matched upper and lower bounds for SignSGD and explicitly characterize the problem class in which SignSGD provably dominates SGD. Other	positive	high	convergence bounds (upper and lower) for SignSGD under specified assumptions	0.2
SignSGD effectively reduces the complexity by a factor of d under sparse noise, where d is the problem dimension (comparison of SignSGD upper bound with SGD lower bound shows a factor-d improvement). Other	positive	high	optimization complexity (iterations/queries) to reach l1-stationarity under sparse noise	reduces complexity by a factor of d 0.2
Extending the sign operator to matrices preserves the optimal scaling with dimensionality: we provide an equivalent optimal lower bound for the Muon optimizer in the matrix domain. Other	positive	high	optimal lower bound / scaling with dimensionality for matrix sign-based optimization (Muon)	preserves optimal scaling with dimensionality 0.2
The theoretical superiority of SignSGD accurately predicts its faster convergence during the pretraining of a 124M parameter GPT-2 model. Other	positive	high	empirical optimization/convergence speed during GPT-2 pretraining	0.12