The reason why we multiply constant c to predicted noise in image inference in stable diffusion (part 2 lesson 9)

In the lesson 9, Jeremy explained why we multiply constant c to predicted noise before subtracting it:

The reason we don’t jump all the way to the final image is because things that look like the image we got by using t=1 (crappy image) never appeared in our training set(does this mean we never train with highly noised images??) and so since it never appeared in our training set our model has no idea what to do with it. Our model only knows how to deal with things that look like somewhat noisy latents and so that’s why we subtract just a small factor of the noise so that we still have a somewhat noisy latent for this process to repeat a bunch of times. (I copied Rekil’s note)

I think this explanation is not accurate to some extent.

I think the reason we only subtract a small factor of predicted noise is not because we want to preserve the noisy latent. If the model predicted noise from a purely random image, it would not have predicted it correctly. Even if we subtract this noise, the random image would still have a noisy latent.

In my opinion, the reason we multiply by a constant is that we want to avoid making too many changes when the prediction is uncertain.

Let me know if I’m wrong and correct me. Thank you!

The sampling process in diffusion is not super clear to me either.

The way I understood it is like this: During the training process, all the training images will end up with a pure noise version at higher timesteps. So, at the sampling stage, when we start with pure noise as an input, the model will try to predict the noise, which will look like somewhat the overall average pixel values of all the training images combined (so, in this sense, it is highly uncertain). However, when the timesteps are closer to zero (low noise ranges), the noise-to-pixel ratio will be low (less noise, more image-like pixel values) and also somewhat unique to different training images (so here, the model becomes more or less certain because only a few training images lead to this noise and pixel value combination).

Based on the above reasoning, it makes sense to be cautious and remove a small portion of the predicted noise (like a learning rate). However, when the time steps are closer to zero, the noise-pixel combination will highly likely go in the direction of a few set of training images that had this noise and pixel value. We can then be certain to remove a larger portion of the predicted noise at this stage here.

The above is not a direct answer to your question, but the above logic helped me understand the sampling process a little better.

Also, the following video by Yang Song helped me understand why we add small Gaussian noise to each iteration step in the sampling process, as well as an overview of generative models.

1 Like

Also in the lecture video, Jeremy seems to be correctly mentioning “somewhat latent” after noise subtraction, he didn’t mention preserve the noisy representation of the latent. I think it may be a little bit unclear due to the notes.

1 Like