Lesson 11 official topic

I know how you feel :slight_smile: I got to that exact point and was stuck since I couldn’t figure out how to binarize the image to get a black and white mask … Then @aayushmnit posted his masking code (post above) and I just took a look at that and that helped me see how it could be done.

But I did have another thought about a different approach today (haven’t tried it out yet though …) what if I used image segmentation to identify the horse and simply made the mask that way? It’s not quite the same method as DiffEdit, but it might be easier to get a better mask for the whole horse that way?

1 Like

I have pushed some update to my notebook. Here are some interim result -

I think the only trouble in implementing the paper is how to apply mask over vae encoder images. Maybe some one can find my notebook helpful - diffusion_playground/4_DiffEdit.ipynb at main · aayushmnit/diffusion_playground · GitHub

4 Likes

For everyone exploring this, it is worth noting there is a version of stable diffusion trained specifically for inpainting (filling in within a mask) now that might work better: runwayml/stable-diffusion-inpainting · Hugging Face

12 Likes

To construct the mask, we take the difference between the denoised latents using the reference prompt and the denoised latents using the query prompt:

(reference_noisy - query_noisy)

What if we introduce a hyperparameter (e.g., alpha ) to amplify the differences we care about:

alpha * (reference_noisy - query_noisy)

In my experiments, it tends to work well. You can experiment with setting alpha > 1 and see what happens. Then, play with the threshold parameter when you binarize the mask.

Here is the binarized mask I extract when I try to replace the dog in the picture with a cat (still not great, but I’m working on it):


2 Likes

When Jeremy defines the numba-decorated code for calculating the dot product, one of the lines has a . after the zero:

from numba import njit

@njit
def dot_product(a, b):
    result = 0.
    for i in range(len(a)):
        result += a[i] * b[i]
    return result

What is the dot after the zero doing? Is this a numba thing? Or are we specifying that we want the result to be a float? (How can zero be a float, though…?)

I also noticed that the mega speed bump (from 450 ms down to 45 microseconds etc) that we saw in the lecture works well when you have trivial arrays of 2x2 etc, but that when you do it with our 5 sample images multiplied by the random weights you basically get no speed bump at all. Why is that the case? Am I running up against some kind of memory or throughput constraint that numba is powerless to help with?

I think it’s just saying result should be float so that a[i] * b[i] can be stored as floats.

1 Like

Thanks Fahim, I’m travelling today but will have a look when I’m home, much appreciated

1 Like

I played with this a bit and cannot confirm this. While the differences get smaller (as a percentage) with increased computation requirements, the Numba compiled functions are consistently faster.

Here is my comparison between a pure Python function, the first run of the Numba version of that function, and the second run of it. The vectors have a size of 100 million. The Numba version is still ~30 times faster than the Python version. Would be interesting to check that systematically for different kinds of functions and data sizes :slight_smile:

3 Likes

Are you running on a machine with a GPU? I tried this on an older-gen MacBook (CPU-only), and in a GPU-enabled cloud environment and it took 4 seconds more or less exactly in both environments.

No, I forgot to mention - this is CPU only (Windows, AMD CPU)

I think I saw the speedup when just calling dot as you do in your example as shown. I seem to lose the speedup when calling dot as part of the bigger matrix multiplication over many values.

I tried this (CPU only again) with the matrix multiplication for the 10.000 validation images - for me, there is no difference between the first and the following Numba runs, but a large difference to the non-compiled version.

1 Like

Fascinating. Thank you for sharing those. No idea how to explain the discrepancy except maybe just different hardware.

For anyone interested in semantic in-painting ClipSeg + the stable diffusion in-painting model shared by @johnowhitaker earlier today work pretty well (for basic cases at least).
ClipSeg automatically creates the segmentation mask from a src image and a prompt (e.g. horse).
Then you simply pass the src image, mask image and insertion prompt (e.g. zebra) to the in-painting model.

Here are some examples:

Here’s a notebook if you want to try it out.

14 Likes

Thanks @johnowhitaker for pointing to the stable diffusion in painting pipeline and @tommyc for sharing your code.

I tried two approaches -

  1. Approach 1 - So, I built on my last shared notebook, I improved masking by adding a OpenCV trick and then generated a new image using img2img pipeline. Then I mixed original image with the new image using masking. Like below -

  2. Approach 2 - Notebook. I used the masked generated by DiffEdit+OpenCV and then used inpaint pipeline.


Here are comparitive result. I think the in-paint pipeline results looks better to me.

6 Likes

Thanks for sharing the code:
orig_noisy-target_noisy gives the following image:


and swapping that around to - target_noisy-orig_noisy yields:

We see they are kinda complimentary in the “horse”
so i took a max - np.maximum(mask1.mean(axis=-1),mask2.mean(axis=-1)) yields a better mask

5 Likes

You’re welcome, John :slight_smile: And if you’re interested, there’s a separate thread where we (at least try to) collaborate on the DiffEdit paper. We are hoping to collaborate on future papers too so that everybody can learn from each others’ efforts …

1 Like

That looks good! I got distracted from DiffEdit by switching over to the 1.5 model but I hope to get back to that today and I was going to do exactly what you did for approach 1 till @johnowhitaker mentioned the inpainting pipeline yesterday and I was like “Doh, how could I have forgotten that?” since I spent a lot of time creating a method to do easy masking for a GUI I did so that the image can be used with the inpainting pipeline — I guess tunnel-vision can really get you :stuck_out_tongue:

But I do want to try out both approaches. If I do get to it, will post results.

1 Like

Hey everyone!! Been staying off this forum for a few days to make sure some kickass implementation doesn’t discourage me from making my own implementation of the DiffEdit paper :sweat_smile:
I’ve written up a Jupyter notebook that documents my attempts at implementing step one - making a mask based on the query text (I’ll allow you to be the judge!). I thought I’d put it on GitHub and render it on nbviewer, but it’s 93MB in size and GitHub has a 25MB size limit. If anyone has any ideas on how I can show my work without having to cut it down, please let me know :slight_smile:

2 Likes

GitLFS? If it’s just hitting the limit around large files, that’d be one way. Or can you just host the file somewhere in cloud storage and have people download it to their own machines?
What is the part that is taking up all that space? images?

1 Like