Limiting the model result to a specific shape

I’m desperately curious to see what the model will do if it is confined to a specific region.
The ideal implementation would be to provide a mask of 0’s and 1’s in the shape of the image and let the model draw only within this region. This is not the same as drawing a big image and cutting the shape at the end, as I want the model to consider the mask shape and adapt the image to it.

My humble (and wrong) approach is to do the following:

  1. create a mask image (a circle).
  2. encode the mask image to latents.
  3. run the unet with random latents and the initial mask as inputs (similar to the img2img aproach shown in lesson 9).
  4. every iteration, keep adding the original mask latents to the result, hoping something nice would come out if it.

here is the code I wrote to do it (only the first step). Credit for np_to_latent goes to @wyquek:

import numpy as np
import cv2

init_image = np.zeros(shape=(512,512,3))
init_image = cv2.circle(init_image, (256, 256), 100, (255,255,255))

import torchvision.transforms as T


def np_to_latent(input_im):
    # Single image -> single latent in a batch (so size 1, 4, 64, 64)
    with torch.no_grad():
        latent = vae.encode(T.ToTensor()(input_im).unsqueeze(0).cuda().float()*2-1) # Note scaling
    return 0.18215 * latent.latent_dist.sample()

init_latent = np_to_latent(init_image)

step = 0
latents = init_latent 
latent_model_input = torch.cat([latents, init_latent])
t = scheduler.timesteps[step]
latent_model_input = scheduler.scale_model_input(latent_model_input, t)
text_embeddings = prep_text(prompt)
# predict the noise residual
with torch.no_grad():
    noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

# perform guidance
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

# compute the previous noisy sample x_t -> x_t-1
latents = scheduler.step(noise_pred, t, latents).prev_sample

This results after 1 step in the following image:

I then continue running many steps with this:

for step in range(1,70):
  latent_model_input = torch.cat([latents, init_latent])
  t = scheduler.timesteps[step]
  latent_model_input = scheduler.scale_model_input(latent_model_input, t)
  text_embeddings = prep_text(prompt)
  # predict the noise residual
  with torch.no_grad():
      noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

  # perform guidance
  noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
  noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

  # compute the previous noisy sample x_t -> x_t-1
  latents = scheduler.step(noise_pred, t, latents).prev_sample
  # show_image(latents)[0]

Which results in something that relates to the mask, but not what I was looking for:

I’m going to keep trying, but thought of sharing this early stages in case someone else wants to try too!!
My next direction will be to convert back to image space (not latent) every step, mask there, and go back to latent. Inefficient, but efficiency is not my goal now :slight_smile:

I apologize if I omitted some definitions or imports that are required for the code above, here are a few of them:

prompt = ["a photograph of an astronaut riding a horse"]
height = 512
width = 512
num_inference_steps = 70
guidance_scale = 7.5
batch_size = 1

anyhow I hope the relevant parts are here, and please ask if you have any trouble…

5 Likes

Looks interesting, are you trying to do something similar to inpainting?

1 Like

nay unfortunately I didn’t write that; it came from the diffusion deep dive notebook from @johnowhitaker

My goal is to create an image in an irregular shape so it’s very similar. Thanks!!
I didn’t know about the inpainting pipeline - it’s very interesting and surely will help me reach my goal.
And in general, great to know the deep dive notebook - it’s a wonderful introductory explanation.

2 Likes

My mistake - didn’t read carefully enough… But anyhow thanks to you I found it!

1 Like

In my mind this seems to be a special case of the general case that is the “wolf howling at the moon” example in Jeremy’s lecture notbook the wolf and moon are essentially specific (arbitrary) shapes as it were and the prompts then constrains it to “wolf” and “moon” …

Interesting concept! What would be a use case for something like this?

Hi Mike,

yes, it’s similar and actually was the first thing I tried. The problem is when the model reinterprets the initial image it strays away from the accurate shape. For example, I used this triangle:

and the model result (leave away prompt and aesthetics for now) was:

which is not in the original shape I gave (and I was gentle here with the number of steps and strength of the model change compared to the init_image. Letting the model be more prominent led to much more skewed versions, but in the future I want the model to give much “stronger” output in the designated regions).

So I’m trying to achieve a pixel level accuracy. Designate a zone and have the model draw only within it, and also I want the model to “consider” the zone’s shape and edge conditions in the resulting image. The motivation? good question. Several thoughts:

  1. It can be a great assistive tool for artists. Sometime you want to use the AI for drawing only a specific part of an image.
  2. (my motivation) It is a step towards using the AI model to create real, physical art which is not confined to rectangles.
  3. I got hooked on this as an exercise to get to know the SD model mechanisms. Sort of an exercise…

Thanks for the suggestion though, let me know if you have more ideas!

2 Likes

Wow! this is a fascinating concept. And I can totally see now how it could be helpful for digital artists. I dabble somewhat in art myself (mainly drawings), simply because I find the experience liberating even though what I create may not be considered art. So, in a way, I can appreciate the need to break out of the bounds of rectangles :slight_smile:

The two examples are fascinating. What would have been an ideal output for that constraint? Given that there is always going to be a prompt that an SD algorithm is going to need to try to map onto its inventory of remembered images (for lack of a batter description). Would it have been acceptable if the shape was exactly as in the control image and not merely similar?

This could also have applications in self driving IMO, because “future happens” at varying rates in different segments of the visual field. The road is the most active part and of most concern, while everything outside of it is of lesser concern (unless something from there moves onto the road |_/ \_| , so limiting a model’s prediction rates to specific arbitrary shapes could possibly be of use in that area as well.

Please keep us posted on your progress!

The ideal output would be black background, imagery in the noised white region (according to prompt), and black triangle in the center. The noise is there so the model will have some random thing to start with (after I saw simple white was not so good).

Not just acceptable - this is my request.

I’ll post here once I have any relevant update.

1 Like

@yonatan365 not sure if you have come across this, but I think the code used in this space might be useful for you:

1 Like

Wow, I think that this is quite it!
I’m actually a bit disappointed - I kind of wanted to build it myself, but i’m too curious to keep going on my own so I’ll peek at the code :slight_smile:

Thanks!

1 Like

Following @bipin 's reference, I read the hugging face Inpainting code. It does what I want, but with poorer resolution than the image. The reason for the poorer resolution is that it applies the mask in the latent space which is x8 smaller in resolution than the original image.

The code is actually almost identical to the normal SD image to image code, with the following changes:

  1. They create a mask in real space (figure with 1’s and 0’s).
  2. They convert the mask to latent space by simply dividing its x’s and y’s by 8 (I thought I was clever and used the VAE to convert my mask to latent space…). Here’s the code line:
mask = mask.resize((w // 8, h // 8), resample=PIL.Image.NEAREST)
mask = np.array(mask).astype(np.float32) / 255.0
mask = np.tile(mask, (4, 1, 1))
  1. In the UNET latents update loop, they only allow updates at the locations where the mask is 1. Where the mask is 0 they keep the original image latent space values. Here’s the code lines:
# masking
init_latents_proper = self.scheduler.add_noise(init_latents_orig, noise, t)
latents = (init_latents_proper * mask) + (latents * (1 - mask))

This is it!
simple and elegant.
The only problem is that it works on x8 lower resolution compared to the image. If I want an intricate mask this won’t work well. Maybe there is no choice but to run directly on the image space, without the vae, or translate back and forth from latent to image, and then apply mask, if I do want intricate mask resolution.

Anyhow I’m going to play with it a bit before I try improving the resolution…

1 Like

this would be a great project to experiment with

2 Likes

I’m curious how the inpainting.py in the HF space above gets called, I couldn’t see any direct references to it in app.py.