Stable Diffusion Parameter Budget Allocation

gstaff · October 25, 2022, 5:13am

Overview

Does the stable-diffusion model have the “optimal” “shape”?

The stable-diffusion model has 3 main parts (excluding the tokenizer and scheduler)

Text encoder (e.g. fixed, pretrained CLIP ViT-L/14) - maps input text to embeddings
VAE (e.g. sd-vae-ft-ema) - compresses / decompresses pixels to / from latents
UNet - denoises latents

If we can only have so many parameters in a model due to resource constraints, how should we allocate them among these 3 components?

Data

Using our example pipeline from the notebook I noted these parameter counts for the components:

Component	# of Parameters	Data Size	Percent of Total
Text encoder	123,060,480	492 MB	12%
VAE	83,653,863	335 MB	8%
UNet	859,520,964	3.44 GB	80%
Total	1,066,235,307	4.27 GB	100%

So from a model size perspective the UNet is the biggest part by a large margin.

Analysis

One of the interesting findings from the Imagen paper was that “increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model”.

Imagen uses T5-XXL as its text encoder over CLIP or BERT. Parameter counts for the T5 family are:

Tiny 16M
Mini 31M
Small 60M
Base 220M
Large 738M
XL 3B
XXL 11B

And their figure shows there is much more impact per parameter on the text encoder side e.g. compare the T5 Small to T5 Large jump to the 300 M to 1 B param UNet change.

So this raises the question: If we have a fixed parameter budget of 1B params / 4 GB due to memory / resource constraints could we get better images by making the UNet smaller and scaling up the text encoder?

Thoughts for Discussion

As I understand it the choice of text encoder for the stable diffusion model was based on what was readily available (e.g. there were already models for CLIP embeddings) and not necessarily optimized. Should an “optimal” image generation model have a different balance of params between the components? Would it be possible to “transfer learn” the stable diffusion model onto an equivalent with a bigger text encoder?

If anyone has played with different mixes of component sizes I’d be interested on hearing what you found.