Share your work here ✅ (Part 2 2022)

Has anyone tried scaling the the noise prediction? I tried applying small pred = pred * scaler multipliers here on a linear scheduler, and it has quite an impact as I guess larger values aggressively de-noise the image, making the result extremely stylised.

All of these were generated using the same parameters except the noise prediction “exaggerations”.

10 Likes

Bring paintings back to life.

I thought I would try generating a photo of Vincent Van Goph by starting with a photo of myself. What really helped here is the negative prompt. Without the prompt we end up with a painting.

prompt = ["photo, portrait, van gogh, detailed, sharp, focus, young, adult"]
negative_prompt = ["vincent, artistic, painting, drawing, style, blurry, old"]
start_step = 19
num_inference_steps = 17+start_step
guidance_scale = 7
generator = torch.manual_seed(33569313)

With and without negative prompt…

More fun this time mixing celebrities.

prompt = ["photo, portrait, elon musk, bradley cooper, detailed, sharp, focus, young, adult"]
negative_prompt = ["artistic, obama, facial hair, painting, blurry, old"]
start_step = 19
num_inference_steps = 85+start_step
guidance_scale = 7
generator = torch.manual_seed(33569313)

8 Likes

Happy Halloween, Generated using code adapted from this notebook.

prompt = [“Cartoon of a dog celebrating Halloween”]
negative_prompt = [“Diwali”]
num_inference_steps = 30
guidance_scale = 7.5

3 Likes

Here’s something I’ve been working on, it’s a project about extending stable diffusion prompts with suitable style cues using text generation model.

Here’s an example of how it works:

You could play with it on HuggingFace Space

For this, I trained a new tokenizer (pre-trained one butchered artist names) on the dataset of stable diffusion prompts, and then trained a GPT-2 model on the same.

Here’s the GitHub repo; it contains all the notebooks for training as well as the gradio app for it. I’ve also uploaded the model and the tokenizer on HuggingFace Hub.

I’d love to know what folks think of this, as well as any feedback or suggestions anyone might have.

6 Likes

Here is my attempted implementation on diffedit: DiffEdit Paper Implementation | The Problem Solvers Guild

5 Likes

Hi everyone -

I have been wondering whether diffusion models can be used as general image classifiers.
Why might this be interesting?

  • Diffusion models may be able to classify images outside of their training data e.g. “astronaut on zebra” vs “astronaut on horse”
  • DMs might be able to classify labels that are not in an existing dataset (e.g. in dog vs cat dataset, “striped cat” vs “plain cat”)
  • DMs might be more resilient to out of distribution inputs (different lighting, backgrounds etc)
  • DMs may be able to explain their predictions well (see Lesson 11 official topic - #89)

If anyone has ideas for how to improve classification accuracy, or thinks that there is a limit to how well DMs can classify, please let me know! Any feedback would be much appreciated.

I tried some experiments in this notebook:

Setup:

  • dogs vs cats dataset
  • using pretrained Stable Diffusion, no additional training on dataset [0]
  • accuracy shown for 200 images [1] for a variety of loss functions that I tried

Results:

The best accuracy was with “latent loss” at 90%. There is a lot of room for improvement! (Kaggle leader is 99% accurate

Generated images

Correct predictions: in this case the generated image for the correct class was “closer” to the input image than the generated image for the incorrect class

Incorrect predictions: in this case the generated image for the incorrect class was “closer” to the input image than the generated image for the incorrect class

Method:

  1. A noised version of image to be classified is used as starting point.
  2. Run diffusion model with two prompts based on class labels e.g. “cat” and “dog”.
  3. The diffusion process is started at an intermediate step (e.g. 30 out of 50)
  4. The diffusion process is guided by the input image with an additional loss function which is applied every 5 steps (similar to the “make image blue” loss function in SD deep dive notebook).
  5. After the two images for each prompt are generated, measure the “distance” between the original image and the two generated images. The prompt which generated the closer image is taken as the predicted class for the input image.

How it works:
Take the image to be classified and run it through the VAE to get it latent encoding x0.
This encoding lies on the “manifold of real images” in the space of latents.
Adding noise moves the latent off the manifold.
Running the diffusion process with prompts corresponding to class labels (c1, c2) results in two new latents (x1, x2) (which should be close to the manifold of real images). The guidance in step 4) makes sure that x1 and x2 are not too far away from x0.
By looking at how close x1,x2 are from x0, we can get an idea of whether the c1 and c2 prompts moved x1, x2 away from x0, or towards it.
The idea is that a prompt which matches the original image should result in a latent which is closer to x0, and so we predict that this prompt is the label of the image.

Here are the losses tried to measure the distance so far:
“latent loss”: is the mse between the final latent of the diffusion process, and the latent of the original image
“noise loss” is the sum of latent losses across the whole diffusion process
“original loss” is the sum of the losses of the guidance function across the diffusion process
“style loss” is the standard VGG based style loss

Heavily based on the lesson 9 Stable Diffusion Deep Dive notebook

[0] It’s possible that the dogs vs cats dataset was included in the Stable Diffusion training data. There might be overfitting if so.
[1] Images taken from the training set, as labels are not available on test set without submitting to Kaggle. This shouldn’t be an issue, as no training was done with dataset.

21 Likes

Got AKed!, awesome!

7 Likes

Yes!!! It’s really awesome!

How the hell did he even figure this shit out. I wonder if he has a steam processor hooked into archive, Reddit ML, and a few other streams.

Then he sets up some keyword filters and then reviews things quickly and posts.

This is crazy(in a good way) lol

A bit late but I was able to get my implementation of DiffEdit working (now I will sleep better). There is still work I need to do to make the masking more robust and general but overall this was so much fun! I feel like polar bears (my favorite animals) are under represented on this thread, so here’s one that was dreamed-up and brought to life using DiffEdit on a dreamed-up image of a horse.

diffedit

Looking for ideas on what to implement next, would love recommendations if some of us are already working on something.
Thanks @jeremy for this challenge. On to the next one.

11 Likes

That’s great! How about sharing a blog post and/or your code explaining how this works?

4 Likes

Definitely. Coming soon :test_tube: :scientist:t3:

2 Likes

Speaking of blogs, I am slowly writing my Stable diffusion series -

Part 2 took my weekday free hours and the entire Saturday today, hope you will find it useful. Here is my plan for this series-

  • Part 3 - Explaining the diffusion process in detail
  • Part 4 - Img2Img pipeline deep dive
  • Part 5 - DiffEdit intro and masking implementation (based on my notebook)
  • Part 6 - DiffEdit Steps 2 and 3 purist implementation and then replace with the in-paint pipeline.
13 Likes

Looking forward to this Aayush! thanks for sharing this :mechanical_arm:

1 Like

4 Likes

Added part 3 on Sunday -

8 Likes

I simplified this script to build a Jupyter Notebook to train textual inversion. I will post the code once I have the notebook cleaned up. It does not use Accelerator etc. It is a single Jupyter Notebook in the style of Stable Diffusion Deep Dive.

I trained it on my Dog Kali.

A charcoal sketch of <kali-dog>

10 Likes

Using quarto i wrote a blog for the diffEdit paper

6 Likes

I still need to clean up the code A LOT as I dig into more details and internalize the concepts. However, I committed the v1 of the notebook, which is somewhat readable. I also added the Conda environment file I am using for this.

Using gradient accumulation, this takes ~11GB of video RAM.

2 Likes

Playing around with dreambooth implementation. One thing I noticed is just how important prompt engineering is.

For example, here I use a much more detailed prompt and get some pretty good results.

In this, I use “sksdre person riding a bike” and it actually returns the original image I used during fine-tuning (and also a random person riding a bike).

One of the things I’d love to figure out how to do is generating images that are photo-realistic (vs. cartoonish).

I saw that the Lexica founder had some success here so just gotta figure out how to replicate and combine with DreamBooth.

2 Likes