muP

Disclaimer: I did not read the OG paper. This is from cs336_lecture_11.pdf, as well as some chat with Gemini 3 Pro.

This is the “by-product” of Greg Yang’s Tensor Program series. It’s actually first discussed in the paper “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer” and then a more accessible version “A Spectral Condition for Feature Learning”

See The series

You can’t train GPT-3 on a single GPU, much less tune its hyperparameters (HPs).

But what if I tell you…

…you can tune its HPs on a single GPU thanks to the theory developed in TP4?

Essentially, narrow and wide neural networks share the same set of optimal hyperparameters if they are in the maximal update parametrization (muP) derived in TP4 (but not if they are in pytorch default parametrization).

muP is based off the following assertion. As a function of the width of the network $n_{l}$ … A1: The activations at initialization should remain $Θ (1)$ A2: After one gradient step, the change in activation should be $Θ (1)$

What's $Θ$

A tight bound, “exactly this scale”, unlike $O$ . So something that’s $Θ (1)$ cannot be $\frac{1}{n}$ , for example, as it vanishes.

Spectral Norm

From conversation with Claude Sonnet 4.6.

Back to SVD

Any matrix $W \in R^{m \times n}$ can be decomposed as:

W = U Σ V^{⊤}

where:

$U \in R^{m \times m}$ — orthogonal matrix (output directions)
$V \in R^{n \times n}$ — orthogonal matrix (input directions)
$Σ$ — diagonal matrix with non-negative entries $σ_{1} \geq σ_{2} \geq \dots \geq 0$

Those diagonal entries $σ_{i}$ are the singular values.

Every matrix is just “rotate → stretch → rotate”:

x V^{⊤} rotate input Σ stretch each axis U rotate output

The singular values are the stretch factors along each axis. So:

$σ_{m a x}$ = the most any direction gets stretched → that’s $∣ W ∣_{2}$
$σ_{m i n}$ = the most any direction gets squished

And the spectral norm is just the largest singlular value, it directly measures the maximum amplification a layer applies to any input direction.

The paper shows that the training can be stable if that spectral norm

∣∣ w ∣∣ = Θ (\frac{fan _{out}}{fan _{in}})

and then some more derivations.

In practice

Initialization: Learning rates: Set to Θ (\frac{1}{n _{l - 1}} min (1, \frac{n _{l}}{n _{l - 1}})) Set to \frac{n _{l}}{n _{l - 1}} (for Adam \frac{1}{n _{l - 1}})

Cerebras GPT uses this and offers nice tables.

I find this implementation quite interesting: ezmup. Different from the official implementation:

First, in your project, change the config so that some weird large enough prime number represents the varying width. By weird, I mean such number should not be used in your other hyperparameters of model shapes. 47 is such good number.

What’s it not robust to

Exotic optimizers
(strong) weight decay

Yanda's Random Notes

Explorer

muP

Spectral Norm

Back to SVD

In practice

What’s it not robust to

Graph View

Table of Contents