Lesson 9 (part 2) preview

We are making the full videos of lessons 9 and 10 of the new “From Deep Learning Foundations to Stable Diffusion” course available as a special preview of the new course! Here’s all the resources for lesson 9 (which includes 3 videos):

Lesson resources

Links from the lesson


This is so great! Thanks, Jeremy :slight_smile: I had resigned myself to waiting a few months to have access to the course. But just having these videos gives me so much to work with! There goes the rest of my day, but I’m so looking forward to watching the two videos and learning from them!


A few bits of feedback on the accompanying notebook(s) — at present they seem to be set up for CUDA only. But the code should work just as well on an Apple Silicon mac (or even on an Intel Mac, but extremely slowly) with just a simple change :slight_smile:

If you add the following line in the second cell after the imports:

device = "cuda" if torch.cuda.is_available() else "mps" if torch.has_mps else "cpu"

Then all you need to do is change any other cells which have .to("cuda") to to.(device) and the code will work on any supported GPU/CPU set up.

Also, if you already have the Hugging Face Stable Diffusion model already downloaded, you can simply set up a symlink (you can do this on any platform — macOS, Linux, Windows) to point to the “stable-diffusion-v1-4” at the location where you have your notebook. Of course, if you are on Colab, it’s easier to download the model all over again — though there’s also a solution there by using a connected Google Drive, but I won’t go into that :stuck_out_tongue:

So if you have the Hugging Face Stable Diffusion model at /Users/myuser/stable-diffusion-v1-4/, then you can simply switch to the folder where you have the Jupyter notebooks and run the following (on Linux/macOS, the Windows command is slightly different):

ln -s /Users/myuser/stable-diffusion-v1-4/ stable-diffusion-v1-4

That’ll create a folder pointing to the original location and save you several gigabytes of space being used up again :slightly_smiling_face:

Then, you’d have to change the following line (or similar ones) from the notebook, to point to your folder instead of the model from the Hugging Face hub, as follows:

pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16).to(device)

The above should be changed to:

pipe = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16).to(device)

Notice that you are pointing to the directory (or the symlink to the directory) where the models are on your local drive.

And finally, if you are on macOS, you should also drop the float16 parts since working with float16 isn’t supported on macOS correctly at the moment. So drop the following from the above line:

, revision="fp16", torch_dtype=torch.float16

To get this, as your final line (but only on macOS):

pipe = StableDiffusionPipeline.from_pretrained("stable-diffusion-v1-4").to(device)


Just finished the first video (Lesson 9) and I wanted to leave my initial feedback/impressions …

I’ve been working with Stable Diffusion code for a good two months now (not to modify the functionality but to create GUIs/output), but only vaguely understood terms such as VAE, latents, timestep, or beta. I knew what some of them were supposed to be/do, but not how they worked or how it all tied together. I finally understand all of this so much better after the first lecture. So, a great big thank you from me :slightly_smiling_face:

Also, I’ve been struggling with performance on a macOS since PyTorch/diffusers has lately become much slower to generate images on Apple Silicon and so I wanted to find out how to make things better. The Tensorflow version of Stable Diffusion gives much better performance but is missing some features I’d like to see and I didn’t have enough knowledge to figure out how to implement those myself. So, you mentioning that you’ll be implementing the code in pure Python made me prick up my ears :slight_smile:

I’m now impatient to see how that part works out but I do realize that I might not be able to see it all till the full course is out. But again, thank you — this has been huge revelation in terms of how much knowledge I’ve gained in a single video!


Hi folks, I have started to look into Lesson 9 video. Where is the main notebook that @jeremy showed in the video? Specifically the part where he is talking about guidance scale, negative prompts, init image, textual inversion, Dreambooth etc.

I got the course repo but do not think it is there.

I believe it’s in the diffusion-nbs repo, which is also linked to up there :slight_smile: It should be in one of the links that Jeremy provided up there, if that isn’t the right one …


Hi, I was using CLIP in some of the projects and that I was getting a single embedding of size 768 per text (or image), meaning a single semantic point per text, which looks pretty reasonable. Stable Diffusion is getting a matrix of a shape (77, 768), which is one point per word. I don’t get intuition why is that and how it works. Thanks

1 Like

Is there a specific reason UNet predicts the noise, instead of predicting the image with some noise reduced? Is it because we want to have control over what percentage of the predicted noise we actually want to subtract from the image?

What is the intuition behind using a unet here? Is it “segmenting” noise from non-noise here or is there another deeper reason that the skip connections are needed?

Lovely fastAI community!

Does anyone have experience with text embeddings having more than 77 tokens (CLIP)?
I am working on an exciting Dataset where the text for an associated image is more extensive, on average, 150 tokens.

Is there a LLM directly outputting a prompt for Stable Diffusion? I could not find anything on the HuggingFace Hub, only GPT2 prompt generation (can I misuse such a model to summarize something to a prompt?)

I hope it’s understandable what I need ^^
I am looking forward to sharing the result here in the forum as soon as it is working :wink:

Thanks for this fantastic video series!
Very much looking forward to the follow-up videos :stuck_out_tongue:


Is it possible to run these notebooks with gpu with low memory (8GB)

In the 1st notebook, pipe.enable_attention_slicing() helped me.

But in the second one (Stable Diffusion Deep Dive.ipynb), I get this error message:

RuntimeError: CUDA out of memory. 
Tried to allocate 512.00 MiB (GPU 0; 7.80 GiB total capacity; 6.15 GiB already allocated; 142.06 MiB free; 6.67 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It looks like it is possible to play with CUDA options, and if anyone already knows how to do it, it will be great!

1 Like

Another question related to an issue I encountered when running those notebooks.

When running inference, my jupyter kernel dies. Or more specifically it runs smooth on my linux machine, and dies on my windows 10 / wsl 2 / ubuntu 22.04 machine.
The difference I have in mind is cuda version.
I use cuda 11.3 on linux, and tried 2 versions in WSL (cuda 12.0, cuda 11.7.1).

Anyone having the same kind of issues?

1 Like

Don’t know the exact answer to your question, but I suggest this reason.

By having the network predict the noise, we can compare exactly the noise we added with the outputs of the neural network. The loss function (using MSE or otherwise) becomes obvious.

If on the other hand, we wanted to calculate the loss based on the outputs of the network as an image with less noise, how would we correlate the noise we added to such outputs?

Thank you so much @jeremy!

I absolutely loved Lesson 9. I’m so excited, can’t wait to continue on with the rest.
There are a couple of things I didn’t understand. Would be grateful if people here could shed some light.

  1. Why/how is the resnet block used to further reduce the number of pixels after the convolutional layers?
  2. When using latents instead of full images, what is the meaning of “adding noise”? Do we add noise as if the latent were a plain image, or do we have to transform that noise in some way?
    What about the resulting latent+noise image? Is it an image that I can display and understand what it is, or it’s just a tensor of numbers that will only make sense after the decoder stage?
1 Like

I found the answer to my issue.

Under WSL, at least with my configuration, I have to export export LD_LIBRARY_PATH=/usr/lib/wsl/lib otherwise I have this error: Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory

Can someone help me find the link to the notebook Jeremy is using in the first lecture?

diffusion-nbs/stable_diffusion.ipynb at master · fastai/diffusion-nbs · GitHub is the link to the notebook being used in lecture 9.

For StableDiffusionImg2ImgPipeline, the attribute init_image needs to change to image.

1 Like

Hi, I watched lesson 9A by @johnowhitaker and was so impressed by details and nice explanations of abstract concepts. Especially liked paper drawings and all comments about bringing more control to the generation process.
I have a follow up question about generation process. Can I use very low resolution image like 32x32 as a guidance to create artistically stylized image with same core object? I wonder if anyone is willing to work on it or give me some guidance how to create that flow. Is that technically possible, and how to make upscaling process generative like?

1 Like

It is definitely possible to provide a low-resolution image as a hint for the generation process. An example is the Stable Diffusion upscaler: stabilityai/stable-diffusion-x4-upscaler · Hugging Face

Another approach is to scale up your 32px image to the desired size, add lots of noise and then denoise again - this way hopefully the general layout is kept while new details are invented. It won’t be perfect and you’ll need a good prompt since your input image won’t have much for the network to work with…

Final note: a new paper using a GAN-based approach came out yesterday with some very impressive super-resolution results. Not diffusion so a bit off-topic but interesting as comparison: GigaGAN: Scaling up GANs for Text-to-Image Synthesis

Also thank you for the kind words :slight_smile: