Lesson 9 official topic

namrata · October 12, 2022, 10:06am

Tried an experiment based on @miwojc observation. I also got curious to check what happens with very large number of steps.

These were the num_inference_steps tried : 1, 2, 3, 32, 64, 128, 256, 512, 768, 1024 (with same prompt as seed as @miwojc attempt)
Also tried to change seed to see if it has any impact on noise. (used a random seed)
Finally, also tried with a different prompt. (prompt2 = “Labrador in the style of Vermeer”, seed=1000)

For the second prompt, step 3 also generated a black image.

miwojc · October 12, 2022, 10:37am

is it possible that it would trigger for “a photograph of an astronaut riding a horse” and that for steps 1 and 2 is black and for step 3 is nosiy image?

arunoda · October 12, 2022, 11:14am

Well, can’t argue much here but that wording might make some NSFW results. For these kinds of stuff, I suggest posting (or searching) on the StableDiffusion subreddit.

arunoda · October 12, 2022, 11:39am

If anyone having issue with running the Textual Inversion script, apply these changes.

sachinruk · October 12, 2022, 12:11pm

lol and I thought stable diffusion was going to be the last lesson. Spent the last 3-6 months trying to get my head around it and JH does it in one hour. (╯°□°)╯︵ ┻━┻

Anyway, here are my notes. J refers to JH: and my questions are S:.

Let me know if there is a better place to post these notes as I plan to do this for the rest of the lectures too.

miwojc · October 12, 2022, 12:27pm

turning safety checker off generates noisy image instead of black image at step 2

fredguth · October 12, 2022, 1:22pm

Have you had the feeling that increasing the number of steps stops helping after a certain point?

namrata · October 12, 2022, 1:29pm

Yes. The results seem reasonable maybe upto step 64. Hugging face pipeline uses 50 as default.

For very large steps, around 512, there was noise similar to step 2. Again at num_steps 768, there was some image, very similar in appearance to num_steps 256. Happened for all the cases I tried.

Wonder what is the mechanism behind this

fredguth · October 12, 2022, 2:05pm

I have the same feeling. I stopped at 100, here are my results:\

torch.manual_seed(1024)
num_rows,num_cols = 5,5
l = list(range(2,100,4))
prompt="a photograph of an astronaut riding a horse"
images = concat(pipe(prompt, num_inference_steps=s, guidance_scale=7).images for s in l)

rghosh · October 12, 2022, 2:33pm

I think both should work.

But when you are predicting the noise, it was the noise introduced artificially.

OTOH, if you want to draw the digit, it can have multiple possible outlines, and all of them will be correct.

The noise, on one picture, has one correct answer.

The digit, has many.

Edit: when we are talking about one image, the model does have only one digit. But we want variations. I think, if we try and make a model that gets the digits as opposed to the noise, it will reduce the variation in generated digits.

rghosh · October 12, 2022, 2:41pm

Have you tried using wrong spelling for generating images not permitted?

Some people tried that with Dall-E 2 and they were able to generate images not officially supported then.

Photorealistic depition of public figures was not allowed by Dall-E before. But giving a prompt like- “Bbarackk Oobana sitting on a bench” returned positive results.

The model is so large and trained on so many images that it learns the mapping between wrong spellings and correct objects as well. The filters put in an ad-hoc mannet can’t catch these.

(I sometimes get better google search results by deliberately using wrong spelling. SEO clerks don’t take this into account, I guess.)

rghosh · October 12, 2022, 2:47pm

I would like to ask, and maybe Jeremy will focus on this question in a later lecture:

What puts the word ‘Diffusion’ in the name ‘Diffusion Model’?

What is diffusion in context of Deep Learning and how is that used in Diffusion Models?

jwuphysics · October 12, 2022, 3:23pm

The first paper on this topic was Deep Unsupervised Learning using Nonequilibrium Thermodynamics by Sohl-Dickstein et al. (2015). In their abstract they discuss their motiviation from nonequilibrium physics:

The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.

Using @johnowhitaker’s excellent example of showing the latent space manifold (Lesson 9A), adding normally distributed noise is equivalent to allowing these latent variables to “diffuse” away from the underlying data distribution and toward random noise.

vijayabhaskar · October 12, 2022, 4:43pm

The Safety checker they use is just a CLIP I think, It’s not perfect in my experience, it’s better to disable it if you can like I said before.

jsa169 · October 12, 2022, 6:36pm

If anyone has run into CUDA out of memory issues on their stable_diffusion.ipynb notebook, I’ve put up a quick and dirty fix as a pull request here: https://github.com/fastai/diffusion-nbs/pull/5 . This at least helped my12GB 3080 TI gpu. I’ll be looking at other ways to improve on this next (mostly based on this optimization recommendations page: https://huggingface.co/docs/diffusers/optimization/fp16 )

nikem · October 12, 2022, 7:01pm

Thanks it worked for me.

SHAR1 · October 12, 2022, 7:29pm

Prompt : An astronaut riding a horse. ( removed ’ A photograph of’ )

Original Prompt pic : notice how some examples do not change after removing ‘photo’

Would be interesting to see how textual inversion reacts to the same.

What does the term ‘photograph’ do ?

Used ‘photograph’ in the negative prompt. Noticed emotional changes in the outputs.

one more example,

Negative prompt ideally should give a cluster, further way from the mentioned term. It seems to drag other embeddings as well. Does it have something to do with the artist? XD

Does a GRAD-CAM equivalent exist for such models?

jamesrequa · October 12, 2022, 9:10pm

Yes. Check this out, attribution maps for stable diffusion using cross-attention

http://daam.ralphtang.com:8080/

nmoran · October 12, 2022, 9:19pm

Great first lecture! Saw in this paper [2208.09392] Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise (linked from HN stable diffusion thread) that there is no requirement for noise to be Gaussian or even random, and can be any operation that degrades and restores the image. Wondering if anyone has any ideas on advantages/disadvantages of different approaches and if this could be an interesting direction to explore.

jsa169 · October 12, 2022, 11:02pm

Here’s an interesting observation-new to me as a stable diffusion n00b at least: I get different results running through a 3080TI gpu versus a 1080TI gpu. All else is the same- the notebook code and the computer it’s running on:

3080TI:

1080TI:

I think it’s just small numerical differences occurring between the two cards that get compounded with the iterations but it’s interesting to see the effect.

Edited: For accuracy, clarity. I misread what was going on a bit.