Lesson 10 official topic

Could one implement everything in part 2 in APL or one of the array programming languages? Would we hit GPU support issues soon? Maybe with MNIST it’t be possible?

5 Likes

Yes that should be fine!

3 Likes

Thanks for the lesson. Building from scratch is awesome.
I had a question regarding the random number generator issue with PyTorch and NumPy. What was the issue with have the same random number generated in context of DL?

If two processes of a DataLoader are generating the same random numbers, then they’re generating the same “randomly” augmented images!

17 Likes

Wow :zap:@Justinpinkney just released the imagic notebook that uses stable diffusion:

13 Likes

See Stable diffusion: resources and discussion - #40 for some tips for avoiding CUDA memory issues

4 Likes

I’m a bit confused about the noise_pred, whether it is the seed noise at t0 or the noise at current step. I’m trying to reason through this myself, but am still fuzzy.

Suppose we predicted some nois_pred from the unet

noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]

If we wish to calculate latents_x0 ourselves, we could do so using

latents_x0 = latents - sigma * noise_pred 

Suppose sigma is the largest value at ~14

sigma = scheduler.sigmas[i] = 14

The calculation thus becomes

latents_x0 = latents - 14 * noise_pred 

Does this mean that noise_pred is the seed noise at x0? Otherwise we wouldn’t need to scale it up by 14?

As far as I understood, the unet model predicts the noise present in a latent. Since we have a latent for text as well as an unconditional prompt(empty string), the noise_pred variable will contain the noise for both these latents as two separate tensors(i.e, one tensor containing the noise of text latent and the other for the noise in unconditional latent).

1 Like

I’m not too sure what you mean by seed noise. But let’s look at pseudo-code for training:

x0 = batch # sample from data
noise = torch.randn(size) # Noise from a gaussian distribution
sigma = scheduler.sigmas[random timestep] # Some sigma value
noisy_x = x0 + sigma*noise # Combine x0 with scaled noise
model_input = scale_model_input(noisy_x) # Scale for model input
model_pred = unet(model_input, t, ...) # Model output
loss = mse_loss(model_pred, noise) # Model output should match (unscaled) noise 
pred_x0 = noisy_x - sigma*model_pred

So the model output tries to match the noise (which was drawn from a normal distribution). I think this is what you mean by ‘seed noise’. The ‘noise at current step’ is just that scaled by sigma.

This is just a design choice of the people training this model. You could also have the model output try to match x0, or the current noise (i.e. noise*sigma), or some other funkier objective. They choose to do it this way so that the model output can always be roughly the same scale (approx unit variance) despite the fact that the ‘current noise’ is going to have wildly different variance depending on the noise level sigma. Whether this is the best way is up for debate - I’m not convinced myself :slight_smile:

9 Likes

Thank you for the great explanation, I understand now :+1:

2 Likes

reading python docs on generators and iterators and i am a bit confused what’s the difference between the two. will sleep on it. maybe will be more clear tomorrow :slight_smile:


1 Like

A generator is one way to create an iterator.

3 Likes

I think I got a working version of negative prompts implemented. I feel like the way I implemented the subtraction of the negative prompt is wonky though, any way I could approach it better?

pred = pred_uncond + (guidance_scale * (pred_text - pred_uncond)) + (guidance_scale * (pred_uncond - pred_neg))
3 Likes

Here’s your equation, replacing the vars with single letters:

u + (g*(t-u)) + (g*(u-n))

We could distribute g over both bits:

u + g*(t-u+u-n)

…then cancel the u’s:

u + g*(t-n)

Not sure if I’ve messed up any of the algebra there, but it if not, sounds like you can simplify the equation a bit. It makes sense to me intuitively that the “direction” you want to head is the difference between the prompt and the negative.

7 Likes

Question on contrastive loss:

The example in the video showed 4 very dissimilar pairs of matched pictures and captions. But what if some of the pics/captions were very similar? Say there were 3 different pictures all containing swans with captions like “a graceful swan”, “a lovely swan”, and “a swan on a lake in a park”. These should all be much more similar to each other than they are to “fast.ai logo”.

Is there some mechanism to allow for similarity for off-diagonal pairs? So the loss would be small for any of the swan related captions paired with any swan related pics? Or does penalizing everything except the diagonal work well enough?

1 Like

Yes, TV-grained type of noise works on pixels. Latent noise works on latent space and when vae.decoded()/‘enlarged’ back into pixels it looks what Jeremy showed.

1 Like

No. In a large dataset, they’re not going to be very similar very often, and this noise will just cancel out in the end, so it can be safely ignored.

2 Likes

bash commands in notebook also work without !

4 Likes

Here is another one which seems to have been released the same day as Imagic, not sure how they compare but these are really cool

10 Likes

Does anybody know a resource to get some intuition on how outpainting works? It’s something I’d like to try, but I cannot really think of how to make it work.

3 Likes