This is needed for the classifier-free guidance (CFG). This is a super useful trick for improving conditioned diffusion models.
I know some folks have already given an explanation but thought I’d provide a different perspective.
First let’s think about classifier guidance. Here we use the gradient of a classifier to update our denoised image in a way that maximizes the correct classification. That looks something like this:
{\widetilde{\mathbf{\epsilon}}}_\theta\left(\mathbf{x}_t,\mathbf{c}\right)=\mathbf{\epsilon}_\theta\left(\mathbf{x}_t,\mathbf{c}\right)-w\sigma_t\nabla_{\mathbf{x}_t}\log{p\left(\mathbf{c}\right|}\mathbf{x}_t)
You can see the regular conditional noise predictor model and an additional term that is the gradient of the classifier with respect to the denoised image (w is the guidance scale, and \sigma is the noise variance from your schedule).
While this greatly improves the results over standard conditional models, the problem is this introduces the need for an additional classifier, one that actually needs to be trained specifically on these noisy images from the diffusion process.
So how could we overcome this? What if we could somehow construct a classifier from the generative model and use that for classifier guidance?
It turns out Bayes’ Rule gives us an expression for the classifier given other terms (here written in log where the multiplications/divisions become additions/subtractions):
\log{p}\left(\mathbf{c}\mid\mathbf{x}_t\right) = \log{p}\left(\mathbf{x}_t\mid\mathbf{c}\right)-\log{p}\left(\mathbf{x}_t\right)+\log{p}\left(\mathbf{c}\right)
The first term is your classifier (probability of class \mathbf{c} given \mathbf{x}_t), second term is the conditional model (\mathbf{x}_t given class \mathbf{c}), third term is the unconditional model (distribution of \mathbf{x}_t) and fourth term is distribution of the classes. We can plug this equation into classifier guidance. Since we are working with gradients with respect to \mathbf{c}, that last term drops out when we plug this into the classifier guidance, and simplifying we get:
{\widetilde{\mathbf{\epsilon}}}_\theta\left(\mathbf{x}_t,\mathbf{c}\right)=\left(1+w\right)\mathbf{\epsilon}_\theta\left(\mathbf{x}_t,\mathbf{c}\right)-w\mathbf{\epsilon}_\theta\left(\mathbf{x}_t\right)
So that’s the basic idea: constructing an implicit classifier from our combined conditional/unconditional generative model (which we represent with a single neural network) and use that for classifier-based guidance.
Jeremy already linked to the blog post I was going to link that goes into this in more detail