^^
The ‘why predict the noise rather than the image’ discussion yesterday is one that is happening in the literature too. The ‘v objective’ here is getting popular as it seems to help stabilize training, and additional framings are being explored too. (This screenshot is from the ‘progressive distillation’ paper which Jeremy showed as the new work bringing down the required number of sampling steps - explainer video here if you want to hear me waffle through an attempted summary).
Give it a few days or weeks and it’s a safe bet the diffusers library will have something like a ‘model_prediction_type’ parameter for epsilon or v or x or … to handle these different parametrizations.
If I remember correctly the HF stable diffusion pipeline has a NSFW/Disturbing images filter on. If you used a prompt that some for any reason the filter thinks might be inappropriate, it will return a black image (in the original SD it would rickroll you ). Keep in mind that as any filter for these kind of things it can be overly aggressive so it is not alway obvious what is going on (e.g. last time I run into this it was filtering out anything related to “dead”, so a cat leaving a dead mouse on the door was blacked out)
is it possible that it would trigger for “a photograph of an astronaut riding a horse” and that for steps 1 and 2 is black and for step 3 is nosiy image?
Well, can’t argue much here but that wording might make some NSFW results. For these kinds of stuff, I suggest posting (or searching) on the StableDiffusion subreddit.
lol and I thought stable diffusion was going to be the last lesson. Spent the last 3-6 months trying to get my head around it and JH does it in one hour. (╯°□°)╯︵ ┻━┻
Anyway, here are my notes. J refers to JH: and my questions are S:.
Let me know if there is a better place to post these notes as I plan to do this for the rest of the lectures too.
Yes. The results seem reasonable maybe upto step 64. Hugging face pipeline uses 50 as default.
For very large steps, around 512, there was noise similar to step 2. Again at num_steps 768, there was some image, very similar in appearance to num_steps 256. Happened for all the cases I tried.
I have the same feeling. I stopped at 100, here are my results:\
torch.manual_seed(1024)
num_rows,num_cols = 5,5
l = list(range(2,100,4))
prompt="a photograph of an astronaut riding a horse"
images = concat(pipe(prompt, num_inference_steps=s, guidance_scale=7).images for s in l)
But when you are predicting the noise, it was the noise introduced artificially.
OTOH, if you want to draw the digit, it can have multiple possible outlines, and all of them will be correct.
The noise, on one picture, has one correct answer.
The digit, has many.
Edit: when we are talking about one image, the model does have only one digit. But we want variations. I think, if we try and make a model that gets the digits as opposed to the noise, it will reduce the variation in generated digits.
Have you tried using wrong spelling for generating images not permitted?
Some people tried that with Dall-E 2 and they were able to generate images not officially supported then.
Photorealistic depition of public figures was not allowed by Dall-E before. But giving a prompt like- “Bbarackk Oobana sitting on a bench” returned positive results.
The model is so large and trained on so many images that it learns the mapping between wrong spellings and correct objects as well. The filters put in an ad-hoc mannet can’t catch these.
(I sometimes get better google search results by deliberately using wrong spelling. SEO clerks don’t take this into account, I guess.)