LoRA

Model fine tuning: fast, memory efficient and good enough.

Why it’s good:

We can swap out the fine-tuning LoRA models for different downstream tasks.
Trains fast. No big memory needed.
No inference latency.

For a pre-trained weight matrix $W_{0} \in R^{d \times k}$ , we constrain its update by representing the latter with a low-rank decomposition $W_{0} + Δ W = W_{0} + B A$ ,where $B \in R^{d} \times r, A \in R^{r \times \tilde{k}}$ , and the rank $r ≪ min (d, k) .$ During training, $W_{0}$ is frozen and does not receive gradient updates, while $A$ and $B$ contain trainable parameters. Note both $W_{0}$ and $Δ W = B A$ are multiplied with the same input, and their respective output vectors are summed coordinate-wise. For $h = W_{0} x$ , our modified forward pass yields:
$h = W_{0} x + Δ W x = W_{0} x + B A x$
We illustrate our reparametrization in Figure 1. We use a random Gaussian initialization for $A$ and zero for $B$ , so $Δ W = B A$ is zero at the beginning of training. We then scale $Δ W x$ by $\frac{α}{r}$ , where $α$ is a constant in $r$ . When optimizing with Adam, tuning $α$ is roughly the same as tuning the learning rate if we scale the initialization appropriately.

lora, page 4

Now, applying this to Transformer.

We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter-efficiency

lora, page 5

Note that putting all the parameters in $Δ W_{q}$ or $Δ W_{k}$ results in significantly lower performance, while adapting both $W_{q}$ and $W_{v}$ yields the best result. This suggests that even a rank of four captures enough information in $Δ W$ such that it is preferable to adapt more weight matrices than adapting a single type of weights with a larger rank.

lora, page 10

Well, $k$ or $q$ shouldn’t really matter here.

Lower rank of LoRA, even 1 or 2 is pretty good, this can be further understood by looking at the subspace similarity, the top values are the most useful, other directions are likely random noise.

First, $Δ W$ has a stronger correlation with $W$ compared to a random matrix, indicating that ∆W amplifies some features that are already in $W$ . Second, instead of repeating the top singular directions of $W$ , $Δ W$ only amplifies directions that are not emphasized in $W$ . Third, the amplification factor is rather huge.

lora, page 12

Yanda's Random Notes

Explorer

LoRA

Graph View