One course content question: Will there be some time devoted to discussing performance optimizations?
It looks like various techniques have been used to reduce the required VRAM for training DreamBooth from 24 GB to under 8 GB! Techniques like these and e.g. gradient accumulation discussed in part one of the course could make the difference in running on consumer hardware.
If time permits would also be very interested in learning more about how to apply sequence models to video (or ct scan slices)… maybe going from resnet activations on individual images/frames to an lstm (or transformer) for overall sequence classification etc in a “graceful” way using fastai (if that’s the best way).
Why do we try to draw the noise rather than go straight to drawing the digit itself? My thought is that our end goal is to draw the digits in this case, but I’m not quite understanding why we try to draw the noise as our model output rather than just drawing the digit directly
I keep getting the below error while running the notebook on colab (free version) with GPU runtime. Any suggestion on how to resolve this?
OSError: There was a specific connection error when trying to load CompVis/stable-diffusion-v1-4:
<class 'requests.exceptions.HTTPError'> (Request ID: o52_DNplzfZM55fVguDXA)
Either works. Predicting the noise (or some scaled version) is convenient for some mathy formulations, and some people hand-wave about it being easier to get NNs to output zero-mean gaussians (but I’m sceptical about that justification ;). If you have the noise then it tells you the ‘direction’ you need to edit your noisy image which is what we want for the sampling step, so that tends to be the convention, at least for now.
The suggested explanation is because we are trying to learn a NN as an optimizer, that is, a function that can estimate the gradients from the noisy images (just like Jeremy showed in today’s lesson in the handwritten digit motivation).
The actual explanation is: it seems to work a lot better. NN that are big enough can learn pretty much anything, so it could be that predicting the noise is just a better regularization method that is helping these networks to learn better latent representations and not overfit as much as GANs or other generative methods.
Although, to be clear, as Jeremy pointed out in the end of the lesson, developing the view of learning a NN as an optimizer seems to be paying off with incredible results in recent papers, so it’s important to keep it in mind.
That’s how I understant things so far, at least. I’m open to corrections!
@jeremy can you please let us know when should we expect the other lecture support material covering the math apart from watching the in-depth video from @johnowhitaker? Also, is there going to help/support if something is not clear from the math bit?
This was an awesome lesson. Really broke down the individual components of stable diffusion and what we are aiming at understanding. Really made stable diffusion uncool, fast.ai style
According to the authors of “Denoising Diffusion Probabilistic Models”, the reason is “Ultimately, our model design is justified by simplicity and empirical results”, so it’s likely that this just works better (as at yesterday)