Lesson 10 official topic

How is the “padding prompt” useful ?

Imagic: Text-Based Real Image Editing with Diffusion Models|

16 Likes

The first step of creating an optimized embedding is not clear. Arent we creating an embedding with CLIP which is input to the pre-trained diffusion model to generate the image? Also, what do they mean when they say optimized.

2 Likes

You can take the embeddings from the clip text encoder set up an optimizer to modify them. You use them to ‘denoise’ the image, see how well it did, then update the embeddings accordingly. The idea is that this gives a new set of embeddings that, when fed to the unet, end up generating images that look more like the input example.

6 Likes

The Hugging Face diffusers notebook for Dreambooth fine-tunes Stable Diffusion, so might be a good starting point for implementing a SD version of Imagic: GitHub, Colab.

13 Likes

So this is yet another instance where we freeze the parameters and fine tune the input. Except that this time we freeze the image as well and optimize the text embeddings, while with normal inference we freeze the text embeddings and optimize the image latents :thinking:

1 Like

How does the color/type of noise affect the results? Is Gaussian explicitly required due to how the model was trained?

1 Like

Yes, Stable Diffusion was trained with Gaussian noise, but recent research suggests you could train with other types of noise as well, with varying levels of success.

6 Likes

Noise is added on VAE latent codes (64x64 image of 4 channels) - there is no concept of color there: the pixels values in that 4 channels are just “semantics” learned by VAE.

Picture is from accompanying notebook from @johnowhitaker

5 Likes

I noticed before the sampling loop the latent is multiplied with scheduler.init_noise_sigma, but there is also scheduler.scale _model in each iteration. What is multiplication with init_noise_sigma for?

2 Likes

(Sander Dieleman’s blog on Guidance)

6 Likes

From the notebook it is used for scaling the latents. The second one with scale model implements another formula as you can see latent_model_input = latent_model_input / ((sigma**2 + 1) ** 0.5)

1 Like

If you take an image and add lots of noise (equivalent to the highest ‘timestep’ during training) you’ll get a result with a standard deviation of ~14 (the max sigma value used during training). Whereas torch.randn gives something with std 1. So, we scale by sigma_max (aka init_noise_sigma) to get something that looks more like the noisiest images the model say during training.

Now the model inputs are not the raw noisy latents - they are a scaled version. Just a choice from the designers. So we get a second scaling bit to get the actual model inputs, which is handled by scheduler.scale_model_inputs (if I’m remembering the function name right).

10 Likes

APL : Array Programming topics. Array programming - fast.ai Course Forums

12 Likes

Could one implement everything in part 2 in APL or one of the array programming languages? Would we hit GPU support issues soon? Maybe with MNIST it’t be possible?

5 Likes

Yes that should be fine!

3 Likes

Thanks for the lesson. Building from scratch is awesome.
I had a question regarding the random number generator issue with PyTorch and NumPy. What was the issue with have the same random number generated in context of DL?

If two processes of a DataLoader are generating the same random numbers, then they’re generating the same “randomly” augmented images!

17 Likes

Wow :zap:@Justinpinkney just released the imagic notebook that uses stable diffusion:

13 Likes

See Stable diffusion: resources and discussion - #40 for some tips for avoiding CUDA memory issues

4 Likes