Stable Diffusion Autoencoder encoding for image semantic similarity or other visual tasks

Hi All,

Has anyone tried using the encoding form stable diffusion VAE (vae.encode) for any other tasks? I tried using it for semantically similar image lookup. Results seem quite random. I tested it on sample of mini imagenet dataset. I understand that the main job of autoencoder is image compression, however since the VAE image encoding is aligned with clip text encoding, and acts as the latent representation for the diffusion model. It seems to me it might be a good starting point for semantic similarity comparisons.

On the contrary when I use the ViT encoding results are so much better. Would love to hear thoughts/pointers from others.


I have not tried this, however, like you said, the VAE’s primary goal is compression. When looking at the latents they are typically human recognizable what their content is which suggests to me they are primarily compressing lower level features like textures over higher level features like a person who is the main subject of an image as typical at least the outline of the person is still visible in the latents.

I believe the Stable Diffusion VAE was trained independently of the UNET which is where the CLIP alignment takes place. I think the bottleneck of the UNET would likely contain a better semantic representation than the VAE latents, though I suspect it won’t be as good as a model originally trained for classification.

1 Like

Thank you. Update here is that I have been able to get good results for image similarity by using both of these encoders:

  1. CLIP models image encoding (which is aligned with the CLIP models text encoding, forcing the image encoder to capture the semantic representation of images.
  2. Dino (ViT) from FB research.

You are right the AE is just converting between latent space and Image space, and is mostly optimized for perceptual compression and not for semantic representation.

Thank you for engaging, helped me get a better understanding.

1 Like

No problem. Yes, those are better choices for semantic representations used for semantic similarity search. I believe that Dino V2 outperformed the latest CLIP models according to the Dino V2 paper for classification tasks which should translate pretty well for semantic similarity. I have done a reasonable amount of experimentation with some of the previous generation CLIP models circa early to mid 2022 and was both amazed in some cases, but disappointed in others. I don’t believe the domain I was interested in was well represented in the CLIP dataset and it did not perform that well . In the Dino V2 paper they talked about being intentional about making their dataset diverse and representative rather than just throwing everything in the dataset, trying to help it more to be more generalized. After doing some limited initial testing with DinoV2 it seemed to do a better job than CLIP with my dataset.