In the Stable Diffusion deep dive notebook, I see that we create a set of embeddings: one for our prompt and one unconditioned, meaning an embedding for the empty prompt. Then, we concatenate these two in a new text embedding tensor, which we pass to the Unet as it predicts the noise residual.
The idea is to use the result to provide guidance, and push our model toward generating images that are closer to the prompt we passed:
I think the conceptual idea goes something like this:
Generate an image without text conditioning. This will make the model move in a random direction so any type of image can be generated.
Generate an image with text conditioning. The model will try to find some image that resembles the text we used.
Consider the vector between the unconditioned and the conditioned images. Move further away in that direction (as far away as dictated by the guidance_scale parameter).
Use that point as the reference for the next step.
So there are two generation processes running in parallel. In practice, the latents for both the unconditional and the conditioned texts are concatenated so we can perform a single forward pass instead of two.
Adding to what @bipin & @pcuenq have mentioned:
if you use the prompt + random noise and no unconditioning you end up with some thing that kinda looks like an otter:
Now if you use the unconditioned scenario where you have no text (actually the model is using a start of text token and a bunch of end of text tokens to represent this) and we use the same random noise that we used to generate the image with the prompt in the above case it gives us a random image
Now we used the same random noise to generate these images the text prompts were different. So if we subtract them we cancel out the effect of the random noise and keep the direction of the difference of text. Then when we multiply this by the guidance_scale, we are moving more in the direction of the prompt - “A watercolor painting of an otter” .
So i was wondering what if we take the same random noise but a different prompt (a prompt that should be very different from the stuff you are trying to generate) say -“solar system with 9 planets”. Now if we subtract them and use guidance_scale as before we get:
Here is the colab notebook if you want to play around with (my explanations may be wrong but it’s fun to play around. Add your insights too, there is some stuff about scheduling guidance_scale, as we get closer to the final image we probably want to reduce thee guidance_scale etc)