Lesson 11 official topic

I’m trying to implement the DiffEdit paper, and to do so, I need to add noise to the input image.

The following is how I’m doing it.

img = Image.open('/content/planet.png').resize((512, 512))
import torchvision.transforms as T
with torch.no_grad():
  lat = vae.encode(T.ToTensor()(img).unsqueeze(0).half().to('cuda')*2-1)
  lat = 0.18215 * lat.latent_dist.sample()
sched = LMSDiscreteScheduler(
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule='scaled_linear',
    num_train_timesteps=1000
)
sched.set_timesteps(15)
noise = torch.randn_like(lat)
ts = tensor([sched.timesteps[10]])
lat = sched.add_noise(lat, noise, timesteps=ts)

However, the last cell outputs the following error and I’m baffled as to why it’s occuring.

RuntimeError: a Tensor with 0 elements cannot be converted to Scalar

Full Traceback
---------------------------------------------------------------------------

RuntimeError Traceback (most recent call last)

[<ipython-input-24-8d2efdc445c5>](https://6umz6pprmf5-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230509-060147-RC00_530563781#) in <cell line: 3>() 1 noise = torch.randn_like(lat) 2 ts = 10 ----> 3 lat = sched.add_noise(lat, noise, timesteps=tensor([sched.timesteps[ts]]))

---
1 frames
---

[/usr/local/lib/python3.10/dist-packages/diffusers/schedulers/scheduling_lms_discrete.py](https://6umz6pprmf5-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230509-060147-RC00_530563781#) in add_noise(self, original_samples, noise, timesteps) 302 timesteps = timesteps.to(original_samples.device) 303 --> 304 step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps] 305 306 sigma = sigmas[step_indices].flatten()

[/usr/local/lib/python3.10/dist-packages/diffusers/schedulers/scheduling_lms_discrete.py](https://6umz6pprmf5-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230509-060147-RC00_530563781#) in <listcomp>(.0) 302 timesteps = timesteps.to(original_samples.device) 303 --> 304 step_indices = [(schedule_timesteps == t).nonzero().item() for t in timesteps] 305 306 sigma = sigmas[step_indices].flatten()

RuntimeError: a Tensor with 0 elements cannot be converted to Scalar

I’ve thoroughly checked the tensors that are being used and none of them have 0 elements. I’ve also tried directly editing the add_noise method, but any changes I make to it don’t seem to be registering (e.g., doing a print statement causes the same error to be thrown, and the traceback says it’s occuring at that print statement); I’m doing this on Google Colab.

You can view the code on Colab here: Google Colab
The relevant code is under the “Add Noise to Image” header (there are two headers of the same name in the notebook; it’s the first one that is the relevant one).

I’d really appreciate help; I’m baffled.

Figured out what was causing the runtime error when I decided to manually implement the sched.add_noise method.

The scheduler’s timesteps were of type float64. Yet when I tried to obtain the sigma at timestep 10, the scheduler returned, for some reason, a tensor of type float32.

The sched.add_noise method involves comparing all the timesteps inside the scheduler, and the specific timesteps you pass in to the sched.add_noise method (only timestep 10 in this case). A boolean tensor is then created, containing True for each specific timestep that exists inside the scheduler timesteps, and False for those that don’t. And from that, another tensor containing only True timesteps is created.

Because the scheduler’s timesteps and the timestep 10 had different precisions, a boolean tensor comprising purely of False values was created, despite timestep 10 existing in the scheduler’s timesteps (i.e. 0.00010002 ≠ 0.0001). Hence a tensor containing 0 elements was created.

To fix this, I set the precision of the scheduler’s timesteps to float32.

sched.timesteps = sched.timesteps.to(torch.float32)
ts = tensor([sched.timesteps[10]])

And the error was solved!

Man, it’s satisfying figuring out what’s happening and managing to solve it heh.

2 Likes

I need some help on how to properly apply the generated mask during the denoising phase of DiffEdit.

If I’m understanding the steps of DiffEdit correctly, the mask is applied during each denoising step — meaning the background pixels of the original image are added to the latent at each denoising step.

However, when I try to implement this, I get the following result (I’m replacing a horse with a zebra).

This is how my denoising loop looks like.

prompt = ['zebra']
img = Image.open('/content/img.png').resize((512, 512))
embs = get_embs(prompt, neg_prompt)
lat = get_lat(img)
inv_mask = 1 - mask
back = torch.mul(F.to_tensor(img).permute(1, 2, 0), torch.from_numpy(inv_mask))


for i, ts in enumerate(tqdm(sched.timesteps)):
  if i >= start_step: 
    lat = denoise(lat, ts)
    back = get_lat(Image.fromarray((back*255).numpy().round().astype(np.uint8)), start_step=i)
    back = decompress(back)
    fore = torch.mul(torch.from_numpy(decompress(lat)), torch.from_numpy(mask))/255
    lat = compress_img(Image.fromarray(((fore+(backn/255))*255).numpy().round().astype(np.uint8)))

back is the background pixels, which I obtain by inverting my mask and applying it to the original image.

fore is the pixels comprising the zebra, which I obtain by decompressing the latent and applying the mask to it.

I then obtain the final latent by adding fore + back together, and then compressing it for the next loop.

I’d really appreciate some help. If you need any more information to do so, please do let me know.

1 Like

In this lesson, it’s mentioned that the reason why the “Background” section in papers is included is to impress the reviewers. Reminds me of the following meme :smile:.

1 Like

Sometimes this paper implementation might be helpful for you.

1 Like

Ah yes, I read through his implementation. I think the main problem is that I wasn’t calculating the mask in the same latent space as the latents themselves, as @matdmiller mentioned here:

That is, my latents are 64x64x4 whereas I my mask is 512x512x3. So to apply the mask, I uncompressed my latents, applied the mask, and then recompressed the latents. Since the compression is lossy, I think that’s why the issue is occurring.

There are a few other differences from my implementation to the actual steps in the paper, but I don’t think they should make much of a difference.

I’ve decided to leave my implementation for now (have spent 2 weeks on it heh) and writing a post on my current implementation. Perhaps I’ll return to it later on in the course.

1 Like

I just figured out how exactly matrix multiplication works through broadcasting. That was an “Aha!” moment :smile:.

Here’s what’s happening.

Let’s say we have the following tensor of 2 images. Each row is a single image.

t1 = tensor([
    [1, 2, 3],
    [4, 5, 6],
])

Here is our tensor of 10 sets of weights. Each column is one set of weights.

t2 = tensor([
    [9, 6, 3],
    [8, 5, 2],
    [7, 4, 1]
])
t1.shape, t2.shape
(torch.Size([2, 3]), torch.Size([3, 3]))

Let’s get a single image.

t1[0].shape, t2.shape
(torch.Size([3]), torch.Size([3, 3]))

We can obviously do the dot product since the dimensions are compatible (t1 acts as a 1x3 tensor). However, this involves using at least 2 for loops, which is slow.

Instead, we can perform the dot product through elementwise multiplication and a single for loop. This is done through broadcasting. (More on the for loop part at the end of this post)

To do this, we need to transpose t1 so it’ll be a column vector/matrix.

t1[0, :, None].shape, t2.shape
(torch.Size([3, 1]), torch.Size([3, 3]))

Now the dimensions still aren’t compatible. However, we can broadcast t1 from this…

t1[0, :, None], t2
(tensor([[1],
         [2],
         [3]]),
 tensor([[9, 6, 3],
         [8, 5, 2],
         [7, 4, 1]]))

…to this.

t1[0, :, None].expand_as(t2), t2
(tensor([[1, 1, 1],
         [2, 2, 2],
         [3, 3, 3]]),
 tensor([[9, 6, 3],
         [8, 5, 2],
         [7, 4, 1]]))

Now we can perform elementwise multiplication.

t3 = t1[0, :, None] * t2; t3
tensor([[ 9,  6,  3],
        [16, 10,  4],
        [21, 12,  3]])

Each column in the resulting matrix is the result of multiplying the image with a particular set of weights (here, we’re multiplying a single image with 3 different sets of weights).

\begin{bmatrix} 1 & 1 & 1 \\ 2 & 2 & 2 \\ 3 & 3 & 3 \end{bmatrix} \odot \begin{bmatrix} 9 & 6 & 3 \\ 8 & 5 & 2 \\ 7 & 4 & 1 \end{bmatrix} = \begin{bmatrix} 1 \cdot 9 & 1 \cdot 6 & 1 \cdot 3 \\ 2 \cdot 8 & 2 \cdot 5 & 2 \cdot 2 \\ 3 \cdot 7 & 3 \cdot 4 & 3 \cdot 1 \end{bmatrix}

\odot denotes elementwise multiplication (also known as the Hadamard product; fun name heh).

To complete the dot product, we can simply sum each column.

t3.sum(dim=0)
tensor([46, 28, 10])

So the resulting dot product by applying the first image with the first set of weights is 46, with the second set of weights is 28, and the third set of weights is 10.

What about the for loop? Since we only applied the weights to a single image, we didn’t need a loop. However, if we have multiple images, we’ll need a single for loop to loop through each image/row in t1. If we didn’t use broadcasting, we’d have to use a for loop for looping through each image/row in t1 and another for loop for each set of weights/column in t2.

After writing this out, and if I’m not missing anything, I think it may be easier to make each row in t2 contain the weights, as opposed to the columns, as it would remove the need to transpose t1.

(Un)successfully finished my implementation of DiffEdit; you can read through it here.

While I managed to generate a mask…

…I wasn’t able to apply it.

This is probably because I repeated uncompressed and recompressed the latents to apply the mask — VAE compression is lossy. My mask is 512x512x3 whereas my latent is 64x64x4.

In fact, if I uncompress and immediately recompress a latent during each step in a normal stable diffusion implementation, this is the outcome.

Ended up using the Hugging Face Stable Diffusion Inpaint Pipeline.

Trying to implement DIffEdit.

For some reason, my original image is darker:

while my newly generated image is lighter. (Please see my reply, I have to split this into multiple replies due to forum’s constraint of 1 embedded media per new user.)

While my newly generated image is lighter:

This makes masking hard as I end up with something weird (one darker and the other lighter shade).

Has anyone faced this issue before?

I want to know a bit more on the DDIM process mentioned in the background section of DiffEdit paper. A quick google search returned a Keras implementation - Denoising Diffusion Implicit Models

Unfortunately I could not find any accessible PyTorch implementation with explanation. Would be grateful if the community could help with the same, thanks in advance!

We cover it in this course.

1 Like

@anirudh15
https://course.fast.ai/Lessons/lesson21.html

2 Likes

Hi,

Does anyone know what does Jeremy uses to effortlessly (or at least the effort is not visible on the video) to draw on onenote or on Jupyter notebooks? Those drawings immensely helped me to understand broadcasting

Thanks

I use a graphics tablet. Onenote lets you draw directly. For other apps like Jupyter I use this: https://presentify.compzets.com/

1 Like

Thanks for sharing the details

Made a quick implementation of DiffEdit in a notebook using the Stable Diffusion Deep Dive notebook as a base. Any feedback is welcome!

In case helpful to anyone else new to the math as well, I wanted to share that that ChatGPT 4 is pretty effective here. You can take a screenshot of the equation, ask it to be explained to you, and get fairly detailed responses – even without providing it any additional context from the paper:

1 Like

I just watched the part of the lecture where JH implements fast matrix multiplication from scratch using broadcasting.

He does this by iterating over each row in the first matrix and broadcasting that row to the second matrix. I assume this is faster because the broadcasting operations all happen in parallel.

The previous inferior version he implemented used iteration over both dimensions and so nothing happened in parallel. But then he replaced one of the iterations with a broadcast, speeding it up.

My question is: why not take this a step further and replace the remaining iteration with broadcasting as well?

Then all the rows of the first matrix could be broadcast to the second broadcast step in parallel. Wouldn’t that speed it up even more?

I know I am late to the party on DiffEdit implementation, but here is my take and results on it if anyone is interested in 2024:

Notebook
20240405_diffedit_consolidated.ipynb

Summary

  1. I used LMSDiscreteScheduler just like in lesson 9 nb.
  2. I removed extreme values in noise prediction difference by clipping bottom and top 10% values. This is a hyper-parameter that can be changed, but the default value of 10% works well. This step is also mentioned in the paper in Section 3.2 Step1.
  3. I averaged the noise difference over 10 random runs. This is also mentioned in the paper in Section 3.2 Step 1. Combination of step 2 and 3 gives high quality and stable masks.
  4. Image generation given mask is very similar to Image2Image example discussed in lesson 9. With slight modification towards the end of the loop where mask is used to replace background latent pixels with corresponding latent pixels from the forward diffusion process.

Results



2 Likes