Why scaling up image before sending to vae?

In the deep dive notebook the image is scaled by 2 and -1 is subtracted before encoding and then during decoding reverse operation is done. Why?

   >  def pil_to_latent(input_im):
# Single image -> single latent in a batch (so size 1, 4, 64, 64)
with torch.no_grad():
    latent = vae.encode(tfms.ToTensor()(input_im).unsqueeze(0).to(torch_device)*2-1) # Note scaling
return 0.18215 * latent.latent_dist.sample()

def latents_to_pil(latents):
# bath of latents → list of images
latents = (1 / 0.18215) * latents
with torch.no_grad():
image = vae.decode(latents).sample
image = (image / 2 + 0.5).clamp(0, 1)
image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
images = (image * 255).round().astype(“uint8”)
pil_images = [Image.fromarray(image) for image in images]
return pil_images

I believe @johnowhitaker mentions in the Deep Dive video that the original model was trained using images set up that way and so we have to follow the same procedure to stay consistent with how the model was trained … But I’d have to look up the video to be certain since I’m going by memory :smile:

Update: I’m watching the Lesson 10 video and Jeremy explains this at around the 46:00 mark. I was wrong about the calculation you pointed to and what Johno said. (What Johno mentioned was the multiplication and later division by 0.18215) Apparently the VAE will output values in the range of -1 to 1 and the calculation you pointed is to convert those values to a range of 0 to 1 as the Python Imaging requires.

3 Likes

Need to watch lesson 10 :slight_smile:

The decoded image is represented as floats between -1 and 1. That line goes to (0, 1). Then we rearrange the channels, multiply by 255 and turn it into ints to get it in the format expected by something like PIL. So just fluff for converting between ways of representing an image :slight_smile: