Knowledge Insulating VLA

This paper basically eliminate that two stage training in Pi 0.5. Remember we need to add action expert / switch loss target in the training process? With this thing you don’t need to do that anymore.

Naively co-training both would lead to worse performance:

prior approaches for finetuning VLMs with continuous outputs can, perhaps unsurprisingly, lead to significantly worse training dynamics, as they rely on gradients from continuous adapters (e.g. diffusion heads) for the training signal. This can degrade both their ability to interpret language commands and the overall performance of the resulting VLA policy.

pi05_KI, page 2

And freezing the VLM is also a good idea:

However, current VLMs are not pre-trained with robotics data. As a result, their representations, when frozen, are insufficient for training highly performant policies, as we show in our experiments

pi05_KI, page 5

Solution

Therefore, we propose to stop the gradient flow from the action expert to the pre-trained weights in the model. This is a sensible restriction if and only if the backbone is additionally trained to predict actions directly as part of its language outputs.

pi05_KI, page 6

The new loss:

L_{CO-VLA} (θ) = E_{D, τ, ω} [- j = 1 \sum n - 1 M_{j}^{ℓ} lo g p_{θ} (\hat{ℓ}_{j + 1} ∣ x_{1 : j}) + α M^{act} ∥ ω - a_{1 : H} - f_{θ}^{a} (a_{1 : H}^{τ, ω}) ∥^{2}]

where $α$ is a loss multiplier, trading off action prediction via flow-matching with the standard language modeling loss. $M^{ℓ}$ is a language loss mask (indicating locations in the token stream at which the language loss should be applied) and $M^{act}$ is an action mask indicator specifying whether or not actions should be predicted for the given example.

Compare with Pi 0.5, we can see the main difference is that mask.

This loss construction allows us to flexibly mix-and-match co-training with data from different modalities. In particular, we combine VLM data (which has only images and text annotations) with action-only data (where the task is action prediction conditioned on images and text) as well as combined language and action prediction tasks (where we take action only data and additionally annotate it with a language description of what the robot should do next)

pi05_KI, page 6

Pi 0.5 has this two stage training that in the second stage, it weights by setting $α = 10$ and train both action expert and the VLA backbone. This value is set to focus more on the action expert and avoid it corrupting the VLA too much. Here we use a better method, so we can still let VLA adapt to robot data, without corrupting it. (If we freeze it after pretraining, it learns less)

For the single head attention case, we can write the attention operation as $P = softmax (Q (X) K (X)^{T} + A) = (P_{bb} P_{ab} 0 P_{aa})$ where $X$ are the inputs to the attention layer, $Q, K$ are the attention query and key projections, respectively, $A$ is the attention mask as described above, and softmax is the row-wise softmax. The result are attention probabilities over token features which decompose into probabilities where features from the VLM backbone attend to features from the backbone $P_{bb}$ , probabilities for action expert features attending to backbone features $P_{ab}$ and probabilities for action expert features attending other action expert features $P_{aa}$ . Given this we can restrict information flow as desired by implementing the softmax computation as

(P_{bb} P_{ab} 0 P_{aa}) = softmax ((Q_{b} (X_{b}) K_{b} (X_{b})^{T} Q_{a} (X_{a}) sg (K_{b} (X_{b})^{T}) 0 Q_{a} (X_{a}) K_{a} (X_{a})^{T}) + A)

where sg denotes the stop-gradient operator that restricts gradient-flow through this part of the computation. $X_{b}$ corresponds to all $x_{i}$ processed with the backbone weights, $X_{a}$ to the tokens processed with the action expert weights. The value embeddings are then computed by

E = (E_{b} E_{a}) = (P_{bb} V_{b} (X_{b}) P_{ab} sg (V_{b} (X_{b})) + P_{aa} V_{a} (X_{a}))

and the final attention is $attn (X) = PE$ . One additional advantage of this design is that we can simply set $α = 1$ in (4), since now the diffusion loss term applies to an independent set of weights.

In other words, as we cut the gradient on backbone features used in key calculation for action expert attending, action expert’s loss term would not affect backbone (kinda). The action loss can still update backbone parameters, but stop-gradient prevents updates flowing through the action→backbone cross-attention K/V pathway.

Yanda's Random Notes

Explorer

Knowledge Insulating VLA

Solution

Graph View