Stable diffusion: resources and discussion

Just read this paper yesterday. The idea of treating the “noise” that you use in diffusion models as a latent vector is a very surprising but powerful concept in my opinion. Lots to explore there…

3 Likes

Another interesting application of diffusion to create different 3D views of an object from a single image + the object’s pose:
https://3d-diffusion.github.io/

2 Likes

Would be nice to generate a mesh from a 3d point cloud input with SD

What’s the current SOTA resource for stable diffusion with respect to inference speed? Is there a resource somewhere that tried to keep up with “SOTA” versions of SD for things like inference speed, memory usage, etc?

This Computerphile video is good:

4 Likes

Really good math explanation of diffusion models:

2 Likes

I have a plan to work on text generation. Glad to find your work as a starting point. I see a list of references in your repo. Which ones do you recommend to read first for the uninitiated?

1 Like

[2205.14217] Diffusion-LM Improves Controllable Text Generation might be a good starting point, but there are a ton of caveats that come with any approach that manipulates latent space for text generation (GANs/VAEs/Normalizing flows).

I plan to add a blog/tutorial in the next week or so as well.

Good luck and I’m looking forward to knowing more about what you find/come up with!

2 Likes

Ha! The video mentions the “enhance” meme. Only yesterday on the drive to work I was thinking… the awful thing about this ML denoising and generative upsizing, is that it belies all my previous vocal criticism of TV memes where technology can super-“enhance” poor security footage.

1 Like

Random thought: Learning about Stable Diffusion, I’m surprised how accurately it resembles Michelangelo’s famous quote “The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material.”. It can be paraphrased as “The image is already complete before I start my work. It is already there, I just have to remove the superfluous noise.” :crazy_face:

6 Likes

There are two new VAEs out from stability. These don’t affect the form of the image, only the final decode step where it scales up from the 64x64 latents to a 512x512 image. The new decoders give much better results, when it comes to details such as small faces, eyes, and letters.

5 Likes

replace this line:
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
with:
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-ema")

4 Likes

Or stabilityai/sd-vae-ft-mse, which has been trained for more steps and seems to give better results.

3 Likes

Some things (e.g hair) can get over-smoothed by this model.

1 Like

I’m sure some of you have spotted Baidu’s latest in the txt2img space: Feng, Zhida, et al. ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts . arXiv, 27 Oct. 2022. arXiv.org , [2210.15257] ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts. (Thanks ZoteroBib.)

They tout quality that surpasses SD, DALL-E 2 and even Google’s titanic Imagen. I can’t yet be too specific about the novel math choices they make (if any) but as I (imperfectly) understand them, the three innovations presented make a lot of intuitive sense!

  1. They add parts-of-speech tokens to the text encoder. This bootstraps the encoder’s ability to understand how words relate to each other. In retrospect, it’s a bit absurd to expect CLIP/BERT to learn all of syntax!

  2. They use an off-the-shelf object detector to identify prominent objects in the picture. They then up-weight those regions for training the diffusion model (the unet, presumably?) I’m not clear on how to do that, but apparently it involves in adding to the loss function.

  3. Finally, another practical improvement: instead of asking the unet to understand the whole diffusion process from noise to final image, let’s create a whole family of unets to master sequential stages of the diffusion process. They use n=10 blocks in the paper.

For us as downstream consumers and manipulators of these models, these techniques aren’t really accessible, as they involve training models from scratch. Maybe some distillation or fine-tuning could achieve #3? Also, when all is said and done, the appendix shows that people still frequently (though <50%) prefer DALL-E or SD output…but it’s hard to disambiguate training data, processing, and algorithm.

Whatever the case, this figure really is exciting – you can see how prompts that are more novel are much more successful with these knowledge-boosting strategies (#1, #2).

9 Likes

I switched to the new VAE and the new 1.5 model from Runway since those do appear to give better results. Haven’t seen weird faces or limbs since then but then again, I don’t do a lot of images with people. So will have to wait and see how it goes …

3 Likes

Stable diffusion multiplayer: Stable Diffusion Multiplayer - a Hugging Face Space by huggingface-projects

Everyone generating parts of a big collage - great fun :slight_smile:

3 Likes

Replying to myself, as this is highly-relevant… nVidia just released their own slice-the-unet-into-staged-experts paper. [2211.01324] eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

They also use an ensemble of text encoders, concatenating the embeddings AFAICT! Neat and practical idea to give the model with more language signal.

There’s also something about having “spatial control” on where your text embeddings get expressed – I need to dig in more to understand what that’s about

5 Likes

Has there been any , stable diffusion on graphs ?

Please use NEGATIVE PROMPTS with Stable Diffusion v2.0