Lesson 9 official topic

ababino · October 16, 2022, 11:53pm

Hi! At the end of each fastaibook chapter, there is a questionnaire. I found that super helpful to test my understanding, so I wrote these questions for Lesson 9. Please add more if you have them!

Part 1: how to get started with stable diffusion.

Questionnaire

Why is this called lesson 9?
What does the strmr.com service do?
Mention the four fastai contributors.
Mention four computing services.
What’s fastai/diffusion-nbs?
What’s the content of suggested_tools.md file? Mention two tools.
What is the main library used in the stable_diffusion.ipynb notebook? What’s the organization behind it?
What’s the main idea of a Hugging Face pipeline, and which fastai tool is the most similar to it?
What’s the from_pretrain method for?
What extra feature Paperspace and Lambda labs have that makes them handier to use with pipelines than Google’s Colab
Which method of the stable diffusion pipeline should you call to produce images from a prompt?
Which torch method should you use to set the random seed?
Why would you set the random seed manually?
Why does the pipeline have many steps, and what does it do in each?
Why do we used 50 steps and not 3? Are these values set in stone?
What does the image_grid function do?
What effect do you get when you change the value of the guidance_scale parameter?
Roughly, how does the guidance_scale work?
What’s the effect of a negative prompt?
How does the image-to-image pipeline work?
What’s the effect of the strength parameter?
How can you use the image2image pipeline twice to produce an even better image?
How was the text-to-pokemon model fine-tuned?
What is “textual inversion”?
What is “dreambooth”?

Part 2.

How can you use a mode/function that outputs the probability that an image is an image of a handwritten digit to generate handwritten digit images?
How can you generate a dataset of images of handwritten digits and non-handwritten digits and labels that indicate how much each image resembles a handwritten digit?
Describe the main components of a neural network (disregard specifics about the architecture)
Describe a network that can predict the noise added to each image in the dataset discussed in question 2.
How can you use the network described in question 4 to generate images of handwritten digits?
In practice, what’s the architecture of such a network?
What’s a reasonable size for representing images of handwritten digits? And for beautiful realistic images? What problem will we face if we want to use the former approach to produce beautiful high-definition images?
Is it possible to efficiently but lossy compress images? Which image format does this?
How can we store high-definition images more efficiently using a neural network? What’s the name of these kinds of networks?
What’s the name of the output of the encoder?
How can you use the network from question 9 to speed up the training and inference of the network of question 4?
How can you modify the network from question 4 to be guided by a particular digit?
What’s the problem with this approach for a data set with images with arbitrary descriptions?
How can we build a dataset of images and descriptions?
Suppose you have the dataset from question 14, a randomly initialized network that produces embeddings from the descriptions and another network that produces embeddings form the images (both embedding types with the same shape). Which loss function could you use to train the networks, so they output similar embeddings for (image, description) pairs that appear in the dataset and different ones for pairs that do not appear in the dataset?
What is the name of the pair of models described in 15?
How can we use the model described in 15 to guide image generation?
What is the name of the loss described in 15?
What is the name of the gradients used in 1?
What other greek letter is used for the standard deviation sigma of the noise?
What is a noise schedule, and what are the time steps?
When we generate an image from random noise, we multiply the noise by a small number before the subtraction instead of subtracting the predicted noise. Why?
What is the role of the diffusion sampler?
What other deep learning object is similar to the diffusion sampler?
Apart from the noisy latent input and the embedding for guidance, what other input is used for the diffusion models? What is the area of math where this idea came from? Do you think this is a necessary input? Why?
Instead of using MSE as the loss, what other loss could be used to better approximate if the resulting image looks real?