Mode Seeking vs Sampling

I’m a bit confused when it comes to predicting the noise using the unet in our forward process. Are we effectively sampling from the conditional distribution of noise or are we taking the mean or are we trying to find the mode of the distribution?

Expanding my thoughts a bit: If I understand correctly, the unet estimates the distribution of noise conditioned on the text embedding, the noised image and the “timestep” t as input. This posterior noise distribution is modelled using a normal distribution with some learned mean and some learned or fixed covariance. Now, when we want to “pick” the estimated noise in each step, I guess we sample using the estimated mean and covariance. I assume this gives us a higher variety of outputs as opposed to just using the mean. Would we get blurry images if we were using the mean or just conservative (i.e., little variation) outputs? If we model noise using the normal distribution, I guess the mean would also be the mode of that distribution?

Background: I want to get boring images that have a high likelihood, instead of exotic creations.

Thanks a lot :slight_smile:

1 Like

Have you tried to use 2 txt encoders at once? Stable diffusion pipeline can use 2 to generate a single image, but you must first figure out which encoder,'s embedding vectors are closer to your deterministic expectations. At your question I’m not sure so don’t want to mislead you Sorry 4 my English. I’m writing from phone from a bed :slight_smile:

1 Like

Thanks Mike! Do you mean using two different encoders for the same prompt? I’ve thought that only one specific CLIP model is compatible to the specifc stable diffusion pipeline…

@abrandl

I mean splitting the prompt into 2 prompts and passing each one to different txt encoders. Second prompt treat like supporting prompt. XL pipeline can be used also with older stabilityai models, code:

from diffusers import StableDiffusionXLPipeline
import torch

pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
).to("cuda")

# prompt is passed to OAI CLIP-ViT/L-14
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# prompt_2 is passed to OpenCLIP-ViT/bigG-14
prompt_2 = "Van Gogh painting"
image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0]
image
'''
The is not negative prompting!!! Better build as an excercise function to handle better this example pipe

Have a nice… trying, fun coding :smiley:

will try, thx!

1 Like