Classifier-free guidance

Vanilla Guidance

We will let the model take another input, prompt $y$ , during training. Instead of sampling the output from $z$ , we sample both $z$ and $y$ . For example, in training the model, we always provide text “dog” with a dog image, “cat” with a cat image, and so on.

Classifier Guidance

With some Bayes rule we can separate the $u_{t}^{target} (x ∣ y)$ to the unguided part and the guided part. For example, with Gaussian probability paths, we can convert the vector field to its score representation…

u_{t}^{target} (x ∣ y) = a_{t} \nabla lo g p_{t} (x ∣ y) + b_{t} x

Next, realize that $p_{t} (x ∣ y)$ is a conditional density. Hence, we can use Bayes’ rule to rewrite the guided score as

p_{t} (x ∣ y) = \frac{p _{t} ( x ) p _{t} ( y ∣ x )}{p _{t} ( y )}

\nabla lo g p_{t} (x ∣ y) = \nabla lo g (\frac{p _{t} ( x ) p _{t} ( y ∣ x )}{p _{t} ( y )}) = \nabla lo g p_{t} (x) + \nabla lo g p_{t} (y ∣ x)

where we used that the gradient $\nabla$ is taken with respect to the variable $x$ , so that $\nabla lo g p_{t} (y) = 0$ . We may thus rewrite

u_{t}^{target} (x ∣ y) = b_{t} x + a_{t} (\nabla lo g p_{t} (x) + \nabla lo g p_{t} (y ∣ x)) = u_{t}^{target} (x) + a_{t} \nabla lo g p_{t} (y ∣ x) .

Notice the shape of the above equation: The guided vector field $u_{t}^{target} (x ∣ y)$ is a sum of the unguided vector field $u_{t}^{target} (x)$ plus a gradient of the likelihood $p_{t} (y ∣ x)$ of the guidance variable $y$ . As people observed that their image $x$ did not fit their prompt $y$ well enough, it was a natural idea to scale up the contribution of the $\nabla lo g p_{t} (y ∣ x)$ term, yielding

\tilde{u}_{t} (x ∣ y) = u_{t}^{target} (x) + w a_{t} \nabla lo g p_{t} (y ∣ x), (classifier guidance)

Well, where do we get the $y ∣ x$ part? Another classifier, thus the name.