Lesson 9 official topic

After going through @ababino’s excellent set of questionnaires I decided to create a wiki post with the answers to the questions.

It’s currently incomplete; please feel free to edit it to add more answers.

Part 1:

  1. Lesson 9 is the continuation of the FastAI part 1 course which has 8 lessons.
  2. strmr.com offers a service to fine-tune diffusion models for subject generation based on DreamBooth.
  3. @ababino @muellerzr @ilovescience @init_27 and many more…
  4. Computing services:
    • Lambda Labs
    • Paperspace Gradient
    • Jarvis Labs
    • vast.ai
  5. It’s a repository containing notebooks to help you get started with stable diffusion.
  6. It contains notebooks and tools created by the AI art community to check out as a starting point.
  7. stable_diffusion.ipynb notebook uses the fantastic diffusers library by good folks at HuggingFace.
  8. HuggingFace pipeline are end-to-end inference pipeline that allows you to get started with just a few lines of code.
  9. We use the from_pretrained to download the pre-trained weights.
  10. Paperspace and Lambda Labs have persistent storage so no need to reinstall any libraries/dependencies every time you start your Notebook, as with Colab.
  11. We could StableDiffusionPipeline to produce images from a prompt.
  12. We can set random seed using the torch method torch.manual_seed(seed)
  13. We should set a random seed manually for the reproducibility of our results.
  14. Stable Diffusion is based on a progressive denoising algorithm; we start with pure random noise and remove some noise incrementally with each step to produce a convincing image.
  15. Currently the model doesn’t do a very good job with only a few denoising steps. The number of denoising steps is not fixed, but the model works well with more.
  16. The image_grid function takes a set of images and displays them in a grid.
  17. The adherence of the generated images to the prompt increases as the value of the guidance scale increases.
  18. For each prompt, two images are created, one with the prompt and another one with no prompt (some random image), and then an average of both images is taken as dictated by the guidance_scale parameter.
  19. Negative prompting refers to using another prompt to generate an image and subtracting it from the image generated by the original prompt.
  20. The Image2Image pipeline starts with a noisy version of an initial image instead of pure noise and gradually denoises it to match the prompt.
  21. The strength parameter specifies to what degree to follow the original image.
  22. We could take the Image2Image pipeline’s output image and feed it back into the pipeline with a different prompt to produce even better images.
  23. The Stable Diffusion model is fine-tuned on a dataset of pokemon images with respective captions.
  24. “textual inversion” is the idea of creating a new token for a particular concept and fine-tuning a single embedding to refer to a particular concept using example images.
  25. “dreambooth” refers to taking an existing token that is rarely used and fine-tuning the model to associate that token with the examples images we provide.
4 Likes

Based on my understanding so far, and I’m new to this too, I think the text/output embeddings are fed in at point B, as point B is where the Unet resides.

Using “Stable Diffusion Deep Dive notebook, The Autoencoder” as an example,

At point A, we convert the image to latent using the vae encoder

def pil_to_latent(input_im):
    # Single image -> single latent in a batch (so size 1, 4, 64, 64)
    with torch.no_grad():
        latent = vae.encode(tfms.ToTensor()(input_im).unsqueeze(0).to(torch_device)*2-1) # Note scaling
    return 0.18215 * latent.latent_dist.sample()

This will output the latent. Next, we feed the latent into the Unet residing at B. Here the latent goes into a for-loop to denoise for certain number of time-steps.

for i, t in tqdm(enumerate(scheduler.timesteps)):

and at each iteration, the latent is fed into the unet

noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings)["sample"]

Inside the Unet, the text/output embeddings are fed repeatedly during the down-block, mid-block, and up-block (refer to huggingface diffuser UNet2DConditionModel codes here. I’ve annotated where the text embeddings are fed using three hashes ###

# 3. down
        down_block_res_samples = (sample,)
        for downsample_block in self.down_blocks:
            if hasattr(downsample_block, "attentions") and downsample_block.attentions is not None:
                sample, res_samples = downsample_block(
                    hidden_states=sample,
                    temb=emb,
                    encoder_hidden_states=encoder_hidden_states,  ### text-embedding fed into down-block
                )
            else:
                sample, res_samples = downsample_block(hidden_states=sample, temb=emb)

            down_block_res_samples += res_samples

        # 4. mid
        sample = self.mid_block(sample, emb, encoder_hidden_states=encoder_hidden_states) ### text-embedding fed into down-block

        # 5. up
        for i, upsample_block in enumerate(self.up_blocks):
            is_final_block = i == len(self.up_blocks) - 1

            res_samples = down_block_res_samples[-len(upsample_block.resnets) :]
            down_block_res_samples = down_block_res_samples[: -len(upsample_block.resnets)]

            # if we have not reached the final block and need to forward the
            # upsample size, we do it here
            if not is_final_block and forward_upsample_size:
                upsample_size = down_block_res_samples[-1].shape[2:]

            if hasattr(upsample_block, "attentions") and upsample_block.attentions is not None:
                sample = upsample_block(
                    hidden_states=sample,
                    temb=emb,
                    res_hidden_states_tuple=res_samples,
                    encoder_hidden_states=encoder_hidden_states, ### text-embedding fed into down-block
                    upsample_size=upsample_size,
                )
            else:
                sample = upsample_block(
                    hidden_states=sample, temb=emb, res_hidden_states_tuple=res_samples, upsample_size=upsample_size

The Unet will output a new latent, which we will feed back into the Unet until the for-loop ends. Everytime a new latent enters the Unet, the text-embedding is fed alongside again in the down-block, mid-block, and up-block.

Once the for-loop ends, we exit B, and feed the final latent to the decoder block (which is after point B).

def latents_to_pil(latents):
    # bath of latents -> list of images
    latents = (1 / 0.18215) * latents
    with torch.no_grad():
        image = vae.decode(latents).sample
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
    images = (image * 255).round().astype("uint8")
    pil_images = [Image.fromarray(image) for image in images]
    return pil_images

Finally, here we get our unique, fantastic, amazing, one-of-a-kind, new image.

2 Likes

Its hard to assist when you only show the error and not the code causing the error, or even better, the whole notebook (on of the downsides of running locally)

Sorry to here in consumed so much time. It would be useful for you to see a summary of what you’ve tried. (sidebar: also btw, my own experience is sometime while writing that summary I see something I missed that is the keystone of the problem)

Speaking very generally, it can be feel daunting to be troubleshooting a huge unfamiliar notebook. A useful approach is rather than troubleshooting the notebook as a whole, break out the thing you have a problem with. Even when the error is in the first few cells, the rest of the code is distracting. For example, considering your problem with a importing a particular numpy function, you could do the following…

  1. Search for: numpy sqrt smallest example import

  2. Find a small example like this

import numpy as geek
arr1 = geek.sqrt([1, 4, 9, 16])
print("square-root of an array1  : ", arr1)

Now if that has problem, its easier to focus yourself and the community on a issue. Its a lot less code to post and for readers to parse.

My other advice is… an old engineering adage… if you can’t solve the problem, change the problem. Although you want to run it locally, try running it in another environment like a cloud server. If it works, you have a point of comparison to compare environments. If it fails, its more visible and easier for the community to assist.

3 Likes

The code that causes the error is the import shown at the top:
from diffusers import StableDiffusionPipeline
and after several layers ends with
AttributeError: module 'numpy' has no attribute 'sqrt'

Today I see that every function that is supposed to be in numpy gives this same attribute error. I can’t even display sys.path (gives Key error trying to print). So something must be very screwed up.

To change this problem I am going delete environments, micromamba, and start over.

1 Like

I don’t think I’ve ever used conda-forge as the channel for numpy. Maybe perhaps just omit that part? You’ll want to uninstall numpy first.

Edit Now that I think about it more, the fact that you’re installing it like this individually to begin with is a sign to me you may be making life more difficult than it needs to be. Here’s what works for me:

My guess is you’d be best off creating a fresh conda environment for this at this point.

Here’s the lesson notes. I’ve also added them to the top post.

6 Likes

I ended up creating a video where I walk through the environment rather than a blog post. Would love to hear your thoughts! Fastai .devcontainer Environment Creation - YouTube

6 Likes

Hi Jason. I was able to get the notebook running by following your simplifying suggestions. Used conda in the end - micromamba and mamba kept reporting version conflicts.

Thank you! Now I’ll stay up all night making astonishing pictures.

4 Likes

Awesome!

I added a PR here Adding small section on enable attention slicing by kevinbird15 · Pull Request #12 · fastai/diffusion-nbs (github.com)

Is this the type of PR you were thinking? I added a section that explains enable_attention_slicing and a commented out line for somebody to use.

2 Likes

Thanks for making the video, I’m going to watch it again and try this later on on my Win+WSL box. I need to read up on the config json files etc too as I don’t use VSCode as much as I should. But for Windows/WSL based hosts I think this would separate different projects nicely!

1 Like

Here is a link to my devcontainer if you want to use it as a reference! Also, feel free to reach out if you run into any issues with your setup.

1 Like

I was creating GIFs based on @johnowhitaker video 9a image to image, and looking at the series of starting at different steps going from the prompted picture to a given real photo. I noticed that for the 50 steps (Guidance 8) there was always an apparent degradation of the final result between starting from step 42 and starting at step 46. The image below shows step 10 (just to give an idea where it was coming from) along with steps 42,44, 46 and 48. Starting at step 46 always shows quite heavy noise - compared to starting at step 42 or 48. I was trying to get my head around this. Is it that it has more noise but not enough steps to clean up compared to the other steps?

. I’m part way through watching lesson 10, maybe I’ll get a hint from that.

2 Likes

This is pretty strange! I’d expect very little change from step 42 to the end, and definitely not that extra noise. Are you sure the images are ordered correctly? If so, I’ll try to re-create this tomorrow and debug :slight_smile:

1 Like

Yes, I’ve produced around 25 GIFs automatically - and noticed this towards the end of all of them. The images are numbered as they are saved based on the start_step so not much room for error. I’ll double check my notebook against your original to make sure I haven’t introduced something odd.

1 Like

And to be clear this isn’t saving at the different steps - it is starting from the different start_step and then these are the final images - but shown as a sequence. Here is a full single set:

Yeah that looks great to me. I won’t be able to approve it myself but that’s what I’d advocate to go in there. Thanks!

1 Like

It was awesome

1 Like

Wow really interesting

I am still obsessed with making amazing pictures, even while trying to keep up with the lessons.

I thought it would be great to be able to regenerate a particularly wonderful picture, rather than lose it forever to the chaos of randomness. Often I only appreciate the picture later on and it’s lost by then. Here’s a way to recreate them.

I was not able to find a way to extract the current random seed from CUDA, without also resetting the seed to “a non-deterministic random number”. (set_rng_state(x) works only for the CPU.) There ought to be a way to get the current seed or RNG state, IMO.

So, before generating an image,

gSeed = torch.seed()     #Sets and returns the seed
gPrompt = prompt
pipe(prompt).images[0]   #Generate the image

.
.
.
to recreate the image,

torch.manual_seed(gSeed)
prompt = gPrompt
pipe(prompt).images[0]

You could save the history of seeds and prompts if you want to get fancier.

Happy psychedelic astronauts everyone.
:slightly_smiling_face:

1 Like