Lesson 9 official topic


^^
The ‘why predict the noise rather than the image’ discussion yesterday is one that is happening in the literature too. The ‘v objective’ here is getting popular as it seems to help stabilize training, and additional framings are being explored too. (This screenshot is from the ‘progressive distillation’ paper which Jeremy showed as the new work bringing down the required number of sampling steps - explainer video here if you want to hear me waffle through an attempted summary).
Give it a few days or weeks and it’s a safe bet the diffusers library will have something like a ‘model_prediction_type’ parameter for epsilon or v or x or … to handle these different parametrizations.

10 Likes

I was wondering if Jeremy kept the OneNote he used throughout the lecture, if so it might be nice to share it? It would act as a good aide memoir.

2 Likes

If I remember correctly the HF stable diffusion pipeline has a NSFW/Disturbing images filter on. If you used a prompt that some for any reason the filter thinks might be inappropriate, it will return a black image (in the original SD it would rickroll you :joy: ). Keep in mind that as any filter for these kind of things it can be overly aggressive so it is not alway obvious what is going on (e.g. last time I run into this it was filtering out anything related to “dead”, so a cat leaving a dead mouse on the door was blacked out)

5 Likes

i used prompt = “a photograph of an astronaut riding a horse”
steps 1 and 2 give black image, step 3 is noisy image…

Pretty sure that’s SFW :smiley: Then I have no idea.

1 Like

Safety checker in the diffusers library has a lot of False positives for NSFW detection.
You can disable the safety filter if you want.

pipe.safety_checker = lambda images, **kwargs: (images, False)
6 Likes

Looking at the screen shot from @miwojc i don’t see “NSFW” written there. IIRC doesn’t it mention “NSFW” in the result or am i imagining this :rofl:

1 Like

Getting rickrolled : a feature, not a bug.

4 Likes

Tried an experiment based on @miwojc observation. I also got curious to check what happens with very large number of steps.

  1. These were the num_inference_steps tried : 1, 2, 3, 32, 64, 128, 256, 512, 768, 1024 (with same prompt as seed as @miwojc attempt)
  2. Also tried to change seed to see if it has any impact on noise. (used a random seed)
  3. Finally, also tried with a different prompt. (prompt2 = “Labrador in the style of Vermeer”, seed=1000)

For the second prompt, step 3 also generated a black image.

5 Likes

is it possible that it would trigger for “a photograph of an astronaut riding a horse” and that for steps 1 and 2 is black and for step 3 is nosiy image?

Well, can’t argue much here but that wording might make some NSFW results. For these kinds of stuff, I suggest posting (or searching) on the StableDiffusion subreddit.

If anyone having issue with running the Textual Inversion script, apply these changes.

1 Like

lol and I thought stable diffusion was going to be the last lesson. Spent the last 3-6 months trying to get my head around it and JH does it in one hour. (╯°□°)╯︵ ┻━┻

Anyway, here are my notes. J refers to JH: and my questions are S:.

Let me know if there is a better place to post these notes as I plan to do this for the rest of the lectures too.

7 Likes

turning safety checker off generates noisy image instead of black image at step 2

8 Likes

Have you had the feeling that increasing the number of steps stops helping after a certain point?

1 Like

Yes. The results seem reasonable maybe upto step 64. Hugging face pipeline uses 50 as default.

For very large steps, around 512, there was noise similar to step 2. Again at num_steps 768, there was some image, very similar in appearance to num_steps 256. Happened for all the cases I tried.

Wonder what is the mechanism behind this :slight_smile:

I have the same feeling. I stopped at 100, here are my results:\

torch.manual_seed(1024)
num_rows,num_cols = 5,5
l = list(range(2,100,4))
prompt="a photograph of an astronaut riding a horse"
images = concat(pipe(prompt, num_inference_steps=s, guidance_scale=7).images for s in l)

1 Like

I think both should work.

But when you are predicting the noise, it was the noise introduced artificially.

OTOH, if you want to draw the digit, it can have multiple possible outlines, and all of them will be correct.

The noise, on one picture, has one correct answer.

The digit, has many.

Edit: when we are talking about one image, the model does have only one digit. But we want variations. I think, if we try and make a model that gets the digits as opposed to the noise, it will reduce the variation in generated digits.

1 Like

Have you tried using wrong spelling for generating images not permitted?

Some people tried that with Dall-E 2 and they were able to generate images not officially supported then.

Photorealistic depition of public figures was not allowed by Dall-E before. But giving a prompt like- “Bbarackk Oobana sitting on a bench” returned positive results.

The model is so large and trained on so many images that it learns the mapping between wrong spellings and correct objects as well. The filters put in an ad-hoc mannet can’t catch these.

(I sometimes get better google search results by deliberately using wrong spelling. SEO clerks don’t take this into account, I guess.)

4 Likes

I would like to ask, and maybe Jeremy will focus on this question in a later lecture:

What puts the word ‘Diffusion’ in the name ‘Diffusion Model’?

What is diffusion in context of Deep Learning and how is that used in Diffusion Models?

2 Likes