Lesson 11 official topic

Did anyone give textual inversion training a go? I see textual_inversion_train.py in the diffusion-nbs repo but am getting a few errors when trying to step thru that… any good starting point that others have tried?

Thanks Much
John

@KevinB How did you get your mask object in latent space? When you calculate a mask in 512x512 dimension and do a vae encode on it, does it retain 1/0 properties correctly?

My mask object is in latent space because it is the output of the unet. So as long as I don’t run the vae.decode step, it will still be in latent space. Let me know if that doesn’t make sense I can try to explain it again.

1 Like

Great discussion. @KevinB ,I was similarly confused as to whether the mask was generated from the noise generated or the final image and the discussion has helped with that. As I have seen others say since the mask generation is the core of the paper, the description is rather vague and subject to different interpretations, its a shame there is no code example. Overall I agree with your interpretation, but as others have shown it can work well with taking the difference of the denoised images (though I suspect this way of doing it it is more dependent upon colour differences between the two images).

I did also note that in the paper they take a blend of the initial image latents and a random noise array of the same shape, which is different to the approach of the scheduler.add_noise, which simply adds the noise to the image, however, I haven’t tried varying things to explore this as a starting point yet.

The other thing I am finding is that the choice of start_step makes a big difference and even moving by a couple of places can seem to generate very different denoised images (start to early and the image typically is lacking the content of the original and very idealised, to late and the image retains too much noise).

It would also be interesting to see how much difference the two types of scheduler make, I have used the LMS one so far but would like to see how well it works using the DDIM one

This is surely something to consider/talk about. After a ton of playing around I had to start Googling “dark horse with no tail” images to try and get better mask generation :joy: (the tail was morphing into appendages, for anyone wondering).

Here are my results after working the whole week on the DiffEdit paper. Thanks also to @aayushmnit, @KevinB, @_lucas and @Fahim for sharing their progress. I’m will post the code later this week. The only different approach in my implementation was to use directly gaussian noise on the input image latents, without using the scheduler. The inpainting is done using the StableDiffusionInpaintPipeline.

First edit is a dog running to a lion running


Then the usual horse to zebra


A bit more difficult one, Kangaroo with sunglasses to Zebra with sunglasses


15 Likes

Mask Generation using noise
I have tried the approach @kevinB followed for an image downloaded from the web. The original image, denoised original, and denoised with zebra prompt are below:

It almost doesn’t need a mask! Using Kevin’s approach to masking I get four normalised difference channels as follows:

Which by then using Kevin’s extract channel mask gives:

It just needs a bit of fine tuning to get rid of the zone above the zebra’s back. I will apply the mask tomorrow and see how smooth the transition from masked to unmasked areas is

8 Likes

I gave it a try sometime back — might have been a couple of months ago maybe? (So many things have happened with Stable Diffusion that I’m losing track of time …) I think I followed a guide based on a Medium article, let me see if I can dig it up …

https://towardsdatascience.com/how-to-fine-tune-stable-diffusion-using-textual-inversion-b995d7ecc095

There might be better guides around now but that’s the one I used back then. My use-case was rather specific (I was trying to generate Great A’Tuin from the Discworld since SD never seems to get that right …) and so it didn’t yield that great results. But I do see a lot of people using textual inversion to great effect these days, so probably should give it a go again :slight_smile:

Edit: @johnrobinsn I just found this as well and since the Hugging Face docs are pretty useful and to the point generally, adding it here:

1 Like

Is anybody having trouble using "runwayml/stable-diffusion-inpainting" pipeline?
When I am trying to generate and image its giving back __call__() missing 1 required positional argument: 'init_image'

image = pipe(
    prompt=["a zebra image"], 
    image=load_image(p), 
    mask_image=load_image('./img/mask.jpeg'), 
    generator=torch.Generator("cuda").manual_seed(100),

).images[0]

It looks as if that particular pipeline requires the first argument to be an image. So if you pass the input image as the first argument, instead of the named image argument, it should (probably) work …

Something like this:

image = pipe(
    load_image(p),
    prompt=["a zebra image"], 
    mask_image=load_image('./img/mask.jpeg'), 
    generator=torch.Generator("cuda").manual_seed(100),

).images[0]

Need some closure for DiffEdit so gonna share some last minute results before the next lesson.
I know it could be improved by all the amazing ideas discussed on this thread. I was inspired by many of them. I’m just running out of steam and maybe will take a look in the future again.

I never used the cv2 tricks or anything but I know that would improve the results if I had a bit more time/energy. I really enjoyed all the tips/discussions from everyone above. I probably will return to this project in the future after trying something else for a bit.
My code is in this repo but it will look different from many of the other notebooks shared here because a lot of the SD functionality is in another class. So it’s a little tricker to reproduce by just opening a notebook but it just helps me to learn better by refactoring code that way.

5 Likes

DiffEdit vs Inpainting

The main contribution of the diffEdit paper is the automatic mask generation.
Once this mask is generated we can continue with the diffEdit algorithm of mixing the latents or use the mask for inpainting.
Both these yield different results.

With the diffEdit algorithm of mixing latents we see the final zebra still retains the brown color of the horse (the input latents from Step 2 are being mixed in)

In the painting case we do not mix the latents. The part that is masked out is never seen by the model, it is discarded.
Hence it generates cleaner white and black stripes.

So inpainting is a better algorithm?
Well in this case inpainting seems to produce the cleaner black and white stripes.

But as the paper points out “(i) inpainting discards information about the input image that should be used in image editing (e.g. changing a dog into a cat should not modify the animal’s color and pose)” so instances when we want color and pose to be retained diffEdit should perform much better than inpainting.
PS: i was curious why Figure 5 in diffEdit has a brownish zebra

5 Likes

Good eye! I didn’t even notice that till you pointed it out, but it makes sense … I mean if you use the mask to replace selectively rather than to inpaint …

Great work. I was also facing similar issues with masks as shown in the last example of person. I create a simple function with fills an area with 1s in mask if its between two 1s. Example of application below.
Link to gist here

4 Likes

I think the mask is being used similarly in both cases but one is in the image space and the other in latent space.
If you look at the johno’s notebook of converting a parrot to nat geo dancer, we see something similar. We were operating in latent space.
It was harder to grasp for me in that notebook with the prompt “A colorful dancer, nat geo photo”, so i replaced that with a “a panda sitting”


The very first image you see some color from the parrot in the panda.
This is what i think is happening :smiley:

3 Likes

I am not absolutely sure that the mask is used the same way :slight_smile: I believe the example from johono’s notebook is because the images are progressively sort of mixed-in as they progress through the diffusion process. And I believe that’s what happens when you do the non-inpaiting for DiffEdit. Or rather, what I think is happening - you take only the parts of the image (in latent space) that are not masked and continue on the diffusion process for that part with the edited prompt. The masked part can either be diffused using the original prompt and/or replaced by the final latent result from the original prompt — depending on how you do things. (I’m just phrasing things here as I see how this might work — the paper probably has a specific way to do this but I don’t recall …)

For the inpainting, my intuition is that it completely replaces the editable portion of the masked image with the results of the edit prompt instead of mixing the edit prompt result with the existing image. Which is why you get completely white stripes on the zebra with this approach, for example …

But I do acknowledge that I might have this totally wrong :slight_smile:

1 Like

Ah i agree with you about johno’s notebook not having masks, but what i was trying to say was when we are in latent space and mask things out that directly doesn’t translate to image space. In the inpainting case the masked out parts are lost. but in the latent space the the parts that are masked are not lost since the neighboring pixels in the latent space have information about it (due to compression)

PS:are we both saying the same thing :grimacing:?

1 Like

For folks looking to improve their masks, you might find morphological operations useful:

https://docs.opencv.org/3.4/d9/d61/tutorial_py_morphological_ops.html

12 Likes

I think we are saying the same thing :smile:

1 Like