Stable diffusion: resources and discussion

ilovescience · October 15, 2022, 10:03am

Just read this paper yesterday. The idea of treating the “noise” that you use in diffusion models as a latent vector is a very surprising but powerful concept in my opinion. Lots to explore there…

Tamori · October 19, 2022, 6:33am

Another interesting application of diffusion to create different 3D views of an object from a single image + the object’s pose:
https://3d-diffusion.github.io/

neuraloverflow · October 19, 2022, 5:11pm

Would be nice to generate a mesh from a 3d point cloud input with SD

KarlH · October 19, 2022, 11:30pm

What’s the current SOTA resource for stable diffusion with respect to inference speed? Is there a resource somewhere that tried to keep up with “SOTA” versions of SD for things like inference speed, memory usage, etc?

ilovescience · October 21, 2022, 12:28am

This Computerphile video is good:

daspartho · October 21, 2022, 3:09am

Really good math explanation of diffusion models:

undertaker2077 · October 21, 2022, 4:30am

I have a plan to work on text generation. Glad to find your work as a starting point. I see a list of references in your repo. Which ones do you recommend to read first for the uninitiated?

amanmadaan · October 21, 2022, 5:48am

[2205.14217] Diffusion-LM Improves Controllable Text Generation might be a good starting point, but there are a ton of caveats that come with any approach that manipulates latent space for text generation (GANs/VAEs/Normalizing flows).

I plan to add a blog/tutorial in the next week or so as well.

Good luck and I’m looking forward to knowing more about what you find/come up with!

bencoman · October 21, 2022, 3:37pm

Ha! The video mentions the “enhance” meme. Only yesterday on the drive to work I was thinking… the awful thing about this ML denoising and generative upsizing, is that it belies all my previous vocal criticism of TV memes where technology can super-“enhance” poor security footage.

joshfp · October 29, 2022, 3:15pm

Random thought: Learning about Stable Diffusion, I’m surprised how accurately it resembles Michelangelo’s famous quote “The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material.”. It can be paraphrased as “The image is already complete before I start my work. It is already there, I just have to remove the superfluous noise.”

sswam · October 29, 2022, 5:14pm

There are two new VAEs out from stability. These don’t affect the form of the image, only the final decode step where it scales up from the 64x64 latents to a 512x512 image. The new decoders give much better results, when it comes to details such as small faces, eyes, and letters.

barnacl · October 29, 2022, 5:38pm

replace this line:
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
with:
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-ema")

sswam · October 29, 2022, 5:40pm

Or stabilityai/sd-vae-ft-mse, which has been trained for more steps and seems to give better results.

jeremy · October 29, 2022, 6:50pm

Some things (e.g hair) can get over-smoothed by this model.

charlie · October 29, 2022, 9:37pm

I’m sure some of you have spotted Baidu’s latest in the txt2img space: Feng, Zhida, et al. ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts . arXiv, 27 Oct. 2022. arXiv.org , [2210.15257] ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts. (Thanks ZoteroBib.)

They tout quality that surpasses SD, DALL-E 2 and even Google’s titanic Imagen. I can’t yet be too specific about the novel math choices they make (if any) but as I (imperfectly) understand them, the three innovations presented make a lot of intuitive sense!

They add parts-of-speech tokens to the text encoder. This bootstraps the encoder’s ability to understand how words relate to each other. In retrospect, it’s a bit absurd to expect CLIP/BERT to learn all of syntax!
They use an off-the-shelf object detector to identify prominent objects in the picture. They then up-weight those regions for training the diffusion model (the unet, presumably?) I’m not clear on how to do that, but apparently it involves in adding to the loss function.
Finally, another practical improvement: instead of asking the unet to understand the whole diffusion process from noise to final image, let’s create a whole family of unets to master sequential stages of the diffusion process. They use n=10 blocks in the paper.

For us as downstream consumers and manipulators of these models, these techniques aren’t really accessible, as they involve training models from scratch. Maybe some distillation or fine-tuning could achieve #3? Also, when all is said and done, the appendix shows that people still frequently (though <50%) prefer DALL-E or SD output…but it’s hard to disambiguate training data, processing, and algorithm.

Whatever the case, this figure really is exciting – you can see how prompts that are more novel are much more successful with these knowledge-boosting strategies (#1, #2).

Fahim · October 30, 2022, 12:55am

I switched to the new VAE and the new 1.5 model from Runway since those do appear to give better results. Haven’t seen weird faces or limbs since then but then again, I don’t do a lot of images with people. So will have to wait and see how it goes …

johnowhitaker · October 30, 2022, 5:17pm

Stable diffusion multiplayer: Stable Diffusion Multiplayer - a Hugging Face Space by huggingface-projects

Everyone generating parts of a big collage - great fun

charlie · November 3, 2022, 9:26am

Replying to myself, as this is highly-relevant… nVidia just released their own slice-the-unet-into-staged-experts paper. [2211.01324] eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

They also use an ensemble of text encoders, concatenating the embeddings AFAICT! Neat and practical idea to give the model with more language signal.

There’s also something about having “spatial control” on where your text embeddings get expressed – I need to dig in more to understand what that’s about

harikrishnanrajeev · November 24, 2022, 12:39pm

Has there been any , stable diffusion on graphs ?

bencoman · November 28, 2022, 9:20pm

Please use NEGATIVE PROMPTS with Stable Diffusion v2.0