Lesson 10 official topic

So this is yet another instance where we freeze the parameters and fine tune the input. Except that this time we freeze the image as well and optimize the text embeddings, while with normal inference we freeze the text embeddings and optimize the image latents :thinking:

1 Like

How does the color/type of noise affect the results? Is Gaussian explicitly required due to how the model was trained?

1 Like

Yes, Stable Diffusion was trained with Gaussian noise, but recent research suggests you could train with other types of noise as well, with varying levels of success.

6 Likes

Noise is added on VAE latent codes (64x64 image of 4 channels) - there is no concept of color there: the pixels values in that 4 channels are just “semantics” learned by VAE.

Picture is from accompanying notebook from @johnowhitaker

5 Likes

I noticed before the sampling loop the latent is multiplied with scheduler.init_noise_sigma, but there is also scheduler.scale _model in each iteration. What is multiplication with init_noise_sigma for?

2 Likes

(Sander Dieleman’s blog on Guidance)

6 Likes

From the notebook it is used for scaling the latents. The second one with scale model implements another formula as you can see latent_model_input = latent_model_input / ((sigma**2 + 1) ** 0.5)

1 Like

If you take an image and add lots of noise (equivalent to the highest ‘timestep’ during training) you’ll get a result with a standard deviation of ~14 (the max sigma value used during training). Whereas torch.randn gives something with std 1. So, we scale by sigma_max (aka init_noise_sigma) to get something that looks more like the noisiest images the model say during training.

Now the model inputs are not the raw noisy latents - they are a scaled version. Just a choice from the designers. So we get a second scaling bit to get the actual model inputs, which is handled by scheduler.scale_model_inputs (if I’m remembering the function name right).

10 Likes

APL : Array Programming topics. Array programming - fast.ai Course Forums

12 Likes

Could one implement everything in part 2 in APL or one of the array programming languages? Would we hit GPU support issues soon? Maybe with MNIST it’t be possible?

5 Likes

Yes that should be fine!

3 Likes

Thanks for the lesson. Building from scratch is awesome.
I had a question regarding the random number generator issue with PyTorch and NumPy. What was the issue with have the same random number generated in context of DL?

If two processes of a DataLoader are generating the same random numbers, then they’re generating the same “randomly” augmented images!

17 Likes

Wow :zap:@Justinpinkney just released the imagic notebook that uses stable diffusion:

13 Likes

See Stable diffusion: resources and discussion - #40 for some tips for avoiding CUDA memory issues

4 Likes

I’m a bit confused about the noise_pred, whether it is the seed noise at t0 or the noise at current step. I’m trying to reason through this myself, but am still fuzzy.

Suppose we predicted some nois_pred from the unet

noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]

If we wish to calculate latents_x0 ourselves, we could do so using

latents_x0 = latents - sigma * noise_pred 

Suppose sigma is the largest value at ~14

sigma = scheduler.sigmas[i] = 14

The calculation thus becomes

latents_x0 = latents - 14 * noise_pred 

Does this mean that noise_pred is the seed noise at x0? Otherwise we wouldn’t need to scale it up by 14?

As far as I understood, the unet model predicts the noise present in a latent. Since we have a latent for text as well as an unconditional prompt(empty string), the noise_pred variable will contain the noise for both these latents as two separate tensors(i.e, one tensor containing the noise of text latent and the other for the noise in unconditional latent).

1 Like

I’m not too sure what you mean by seed noise. But let’s look at pseudo-code for training:

x0 = batch # sample from data
noise = torch.randn(size) # Noise from a gaussian distribution
sigma = scheduler.sigmas[random timestep] # Some sigma value
noisy_x = x0 + sigma*noise # Combine x0 with scaled noise
model_input = scale_model_input(noisy_x) # Scale for model input
model_pred = unet(model_input, t, ...) # Model output
loss = mse_loss(model_pred, noise) # Model output should match (unscaled) noise 
pred_x0 = noisy_x - sigma*model_pred

So the model output tries to match the noise (which was drawn from a normal distribution). I think this is what you mean by ‘seed noise’. The ‘noise at current step’ is just that scaled by sigma.

This is just a design choice of the people training this model. You could also have the model output try to match x0, or the current noise (i.e. noise*sigma), or some other funkier objective. They choose to do it this way so that the model output can always be roughly the same scale (approx unit variance) despite the fact that the ‘current noise’ is going to have wildly different variance depending on the noise level sigma. Whether this is the best way is up for debate - I’m not convinced myself :slight_smile:

9 Likes

Thank you for the great explanation, I understand now :+1:

2 Likes

reading python docs on generators and iterators and i am a bit confused what’s the difference between the two. will sleep on it. maybe will be more clear tomorrow :slight_smile:


1 Like