The paper basically covers how to do Scaling Law analysis on a ViT.

Our method involves both a functional form (2) and a novel procedure sovit, page 5

The functional form

  • containing shape dimensions, such as width, depth and MLP size. This is the neural architecture.
  • , compute such as GFLOPS, .
  • A performance metric of interest, such as downstream ImageNet 10-shot error rate. Specifically, results from (pre)-training an architecture for a fixed compute budget . We always assume that corresponds to a loss, meaning lower values are better.

For each dimension , the paper argues that the function form is as follows: where . Here, focuses on the dimension alone and assumes that all other shape dimensions are sufficiently large such that they do not constitute a bottleneck.

Why?

Our argument for this particular functional form is six-fold

sovit, page 4

And I’ll omit that in this note.

Now let’s take the derivative and set to zero w.r.t. : we got the optimal one:

Thus , and we call that exponent

We can then derive that

The procedure

Star Sweep

Start from a large and random model, use that as the star center, and then varying a single dimension at a time in an exponentially-spaced-grid, going down. In practice, they use width, depth and MLP dim. They only go down to make sure other dims do not form a bottleneck when estimating the params.

For example, their start center is . For the MLP center, they used grid , 20% increase in each step.

This process gives the scaling component, the part.

Grid Sweep

Now we use small models. Grid search! Go test a bunch and find the configuration to be at Pareto Front. For example, they find it to be . This is the leading coefficient of .

Together

Now we have the scaling factor for each dim, and where o start. We can simply scale. In this paper, they simply scale it equally within the budget.