I’m desperately curious to see what the model will do if it is confined to a specific region.
The ideal implementation would be to provide a mask of 0’s and 1’s in the shape of the image and let the model draw only within this region. This is not the same as drawing a big image and cutting the shape at the end, as I want the model to consider the mask shape and adapt the image to it.
My humble (and wrong) approach is to do the following:
- create a mask image (a circle).
- encode the mask image to latents.
- run the unet with random latents and the initial mask as inputs (similar to the img2img aproach shown in lesson 9).
- every iteration, keep adding the original mask latents to the result, hoping something nice would come out if it.
here is the code I wrote to do it (only the first step). Credit for np_to_latent goes to @wyquek:
import numpy as np
import cv2
init_image = np.zeros(shape=(512,512,3))
init_image = cv2.circle(init_image, (256, 256), 100, (255,255,255))
import torchvision.transforms as T
def np_to_latent(input_im):
# Single image -> single latent in a batch (so size 1, 4, 64, 64)
with torch.no_grad():
latent = vae.encode(T.ToTensor()(input_im).unsqueeze(0).cuda().float()*2-1) # Note scaling
return 0.18215 * latent.latent_dist.sample()
init_latent = np_to_latent(init_image)
step = 0
latents = init_latent
latent_model_input = torch.cat([latents, init_latent])
t = scheduler.timesteps[step]
latent_model_input = scheduler.scale_model_input(latent_model_input, t)
text_embeddings = prep_text(prompt)
# predict the noise residual
with torch.no_grad():
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# perform guidance
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# compute the previous noisy sample x_t -> x_t-1
latents = scheduler.step(noise_pred, t, latents).prev_sample
This results after 1 step in the following image:
I then continue running many steps with this:
for step in range(1,70):
latent_model_input = torch.cat([latents, init_latent])
t = scheduler.timesteps[step]
latent_model_input = scheduler.scale_model_input(latent_model_input, t)
text_embeddings = prep_text(prompt)
# predict the noise residual
with torch.no_grad():
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# perform guidance
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# compute the previous noisy sample x_t -> x_t-1
latents = scheduler.step(noise_pred, t, latents).prev_sample
# show_image(latents)[0]
Which results in something that relates to the mask, but not what I was looking for:
I’m going to keep trying, but thought of sharing this early stages in case someone else wants to try too!!
My next direction will be to convert back to image space (not latent) every step, mask there, and go back to latent. Inefficient, but efficiency is not my goal now
I apologize if I omitted some definitions or imports that are required for the code above, here are a few of them:
prompt = ["a photograph of an astronaut riding a horse"]
height = 512
width = 512
num_inference_steps = 70
guidance_scale = 7.5
batch_size = 1
anyhow I hope the relevant parts are here, and please ask if you have any trouble…