It was mentioned that when getting the VAE latent embeddings, the constant 0.18125 was used to scale the latents in the original paper. Was there a reason this specific number was picked (i.e. it has some property), or was it more “we tried many values and this one seemed to work the best”?
Here is an explanation directly from the lead author/developer of latent diffusion and Stable Diffusion:
We introduced the scale factor in the latent diffusion paper. The goal was to handle different latent spaces (from different autoencoders, which can be scaled quite differently than images) with similar noise schedules. The scale_factor ensures that the initial latent space on which the diffusion model is operating has approximately unit variance. Hope this helps
I like Understanding Diffusion Models: A Unified Perspective, although it took some time (and pain) to go through. The author went through every single line of math with some sort of annotation or explaination, without skipping any step. That helps to take away a lot of guesswork.
100%. This level of detail is not needed to train/inference on stable diffusion. However, this is the perfect resource for people who want to go deep and understand math fully.
I just added this talk on the 2015 paper by Jascha Sohl-Dickstein (lead author) to the wiki, but wanted to highlight here since I think it’s great and I haven’t seen it mentioned before: