SoViT

The paper basically covers how to do Scaling Law analysis on a ViT.

Our method involves both a functional form (2) and a novel procedure sovit, page 5

The functional form

$x = (x_{1}, x_{2}, \dots, x_{D}) \in N^{D}$ containing $D$ shape dimensions, such as width, depth and MLP size. This is the neural architecture.
$t$ , compute such as GFLOPS, $\propto 6 Data * x$ .
$f : N^{D} \times R^{+} \to R$ A performance metric of interest, such as downstream ImageNet 10-shot error rate. Specifically, $f (x, t)$ results from (pre)-training an architecture $x$ for a fixed compute budget $t$ . We always assume that $f$ corresponds to a loss, meaning lower values are better.

For each dimension $x_{k}$ , the paper argues that the function form is as follows:

f_{k} (x_{k}, t) \sim α_{k} x_{k}^{- a_{k}} + (β_{k} x_{k}^{b_{k}} + ξ_{k}) t^{- c} + ε_{k},

where $α_{k}, a_{k}, β_{k}, b_{k}, c, ξ_{k}, ε_{k} > 0$ . Here, $f_{k}$ focuses on the dimension $k$ alone and assumes that all other shape dimensions $j \neq = k$ are sufficiently large such that they do not constitute a bottleneck.

Why?

Our argument for this particular functional form is six-fold

sovit, page 4

And I’ll omit that in this note.

Now let’s take the derivative and set to zero w.r.t. $x_{k}$ : we got the optimal one:

x_{k}^{⋆} = (\frac{α _{k} a _{k} t ^{c}}{β _{k} b _{k}})^{\frac{1}{b _{k} + a _{k}}}

Thus $x_{k}^{⋆} \propto t^{c / (b_{k} + a_{k})}$ , and we call that exponent $s_{k}$

We can then derive that

f_{k} (x_{k}^{⋆}, t) = F (x_{k}^{⋆})^{- a_{k}} + G t^{- c} + ε_{k}

The procedure

Star Sweep

Start from a large and random model, use that as the star center, and then varying a single dimension $k \in [D]$ at a time in an exponentially-spaced-grid, going down. In practice, they use width, depth and MLP dim. They only go down to make sure other dims do not form a bottleneck when estimating the params.

For example, their start center is $(1968, 40, 6144)$ . For the MLP center, they used grid $(1088, 1360, 1728, 2160, 2592, 3072)$ , 20% increase in each step.

This process gives the scaling component, the $s_{k}$ part.

Grid Sweep

Now we use small models. Grid search! Go test a bunch and find the configuration to be at Pareto Front. For example, they find it to be $(608, 10, 928)$ . This is the leading coefficient of $x_{k}^{⋆}$ .

Together

Now we have the scaling factor for each dim, and where o start. We can simply scale. In this paper, they simply scale it equally within the budget.

Yanda's Random Notes

Explorer