Lesson 11 official topic

Your work on the mask helped me get a handle on creating a mask at my end. Thank you!

My notebook is here if you’d like to see how things are progressing …

2 Likes

I simply set up a break point for the timesteps loop so that instead of going to a specific step, the loop would iterate over each step till it got to a specific step and then break. Or, for the second stage where you want to start from some noise, skip over some steps and then start looping … Seems to work but I didn’t use the DDIM scheduler. Just the LMSD one we’ve been using so far …

My notebook is here if you’d like to take a look …

4 Likes

Really impressed by how far some have you are getting on DiffEdit.
I am still trying to do step 1 which is to create the mask.
Im trying to not look at anyones code yet and see how far I can get on my own.
Im just using the same old scheduler LMSDiscreteScheduler we used in the notebook lesson 9 deep dive. Adding a little bit of noise and getting the two noise predictions for the two text prompts.
The input image looks like this:

and the difference of the noises (noise1-noise2) looks like this

I know I need to now normalize this or something and binarize it and make it white and black. Ha, sounds so simple but Ive been stuck on that part lol.

5 Likes

I know how you feel :slight_smile: I got to that exact point and was stuck since I couldn’t figure out how to binarize the image to get a black and white mask … Then @aayushmnit posted his masking code (post above) and I just took a look at that and that helped me see how it could be done.

But I did have another thought about a different approach today (haven’t tried it out yet though …) what if I used image segmentation to identify the horse and simply made the mask that way? It’s not quite the same method as DiffEdit, but it might be easier to get a better mask for the whole horse that way?

1 Like

I have pushed some update to my notebook. Here are some interim result -

I think the only trouble in implementing the paper is how to apply mask over vae encoder images. Maybe some one can find my notebook helpful - diffusion_playground/4_DiffEdit.ipynb at main · aayushmnit/diffusion_playground · GitHub

4 Likes

For everyone exploring this, it is worth noting there is a version of stable diffusion trained specifically for inpainting (filling in within a mask) now that might work better: runwayml/stable-diffusion-inpainting · Hugging Face

12 Likes

To construct the mask, we take the difference between the denoised latents using the reference prompt and the denoised latents using the query prompt:

(reference_noisy - query_noisy)

What if we introduce a hyperparameter (e.g., alpha ) to amplify the differences we care about:

alpha * (reference_noisy - query_noisy)

In my experiments, it tends to work well. You can experiment with setting alpha > 1 and see what happens. Then, play with the threshold parameter when you binarize the mask.

Here is the binarized mask I extract when I try to replace the dog in the picture with a cat (still not great, but I’m working on it):


2 Likes

When Jeremy defines the numba-decorated code for calculating the dot product, one of the lines has a . after the zero:

from numba import njit

@njit
def dot_product(a, b):
    result = 0.
    for i in range(len(a)):
        result += a[i] * b[i]
    return result

What is the dot after the zero doing? Is this a numba thing? Or are we specifying that we want the result to be a float? (How can zero be a float, though…?)

I also noticed that the mega speed bump (from 450 ms down to 45 microseconds etc) that we saw in the lecture works well when you have trivial arrays of 2x2 etc, but that when you do it with our 5 sample images multiplied by the random weights you basically get no speed bump at all. Why is that the case? Am I running up against some kind of memory or throughput constraint that numba is powerless to help with?

I think it’s just saying result should be float so that a[i] * b[i] can be stored as floats.

1 Like

Thanks Fahim, I’m travelling today but will have a look when I’m home, much appreciated

1 Like

I played with this a bit and cannot confirm this. While the differences get smaller (as a percentage) with increased computation requirements, the Numba compiled functions are consistently faster.

Here is my comparison between a pure Python function, the first run of the Numba version of that function, and the second run of it. The vectors have a size of 100 million. The Numba version is still ~30 times faster than the Python version. Would be interesting to check that systematically for different kinds of functions and data sizes :slight_smile:

3 Likes

Are you running on a machine with a GPU? I tried this on an older-gen MacBook (CPU-only), and in a GPU-enabled cloud environment and it took 4 seconds more or less exactly in both environments.

No, I forgot to mention - this is CPU only (Windows, AMD CPU)

I think I saw the speedup when just calling dot as you do in your example as shown. I seem to lose the speedup when calling dot as part of the bigger matrix multiplication over many values.

I tried this (CPU only again) with the matrix multiplication for the 10.000 validation images - for me, there is no difference between the first and the following Numba runs, but a large difference to the non-compiled version.

1 Like

Fascinating. Thank you for sharing those. No idea how to explain the discrepancy except maybe just different hardware.

For anyone interested in semantic in-painting ClipSeg + the stable diffusion in-painting model shared by @johnowhitaker earlier today work pretty well (for basic cases at least).
ClipSeg automatically creates the segmentation mask from a src image and a prompt (e.g. horse).
Then you simply pass the src image, mask image and insertion prompt (e.g. zebra) to the in-painting model.

Here are some examples:

Here’s a notebook if you want to try it out.

14 Likes

Thanks @johnowhitaker for pointing to the stable diffusion in painting pipeline and @tommyc for sharing your code.

I tried two approaches -

  1. Approach 1 - So, I built on my last shared notebook, I improved masking by adding a OpenCV trick and then generated a new image using img2img pipeline. Then I mixed original image with the new image using masking. Like below -

  2. Approach 2 - Notebook. I used the masked generated by DiffEdit+OpenCV and then used inpaint pipeline.


Here are comparitive result. I think the in-paint pipeline results looks better to me.

6 Likes

Thanks for sharing the code:
orig_noisy-target_noisy gives the following image:


and swapping that around to - target_noisy-orig_noisy yields:

We see they are kinda complimentary in the “horse”
so i took a max - np.maximum(mask1.mean(axis=-1),mask2.mean(axis=-1)) yields a better mask

5 Likes

You’re welcome, John :slight_smile: And if you’re interested, there’s a separate thread where we (at least try to) collaborate on the DiffEdit paper. We are hoping to collaborate on future papers too so that everybody can learn from each others’ efforts …

1 Like