Share your work here ✅ (Part 2 2022)

I did a quick test with my text2img notebook (link).

And got this result with slightly tweaked prompts.

prompt: ‘Close-up photography of the face of a 21 years old girl, by Alyssa Monks, by Joseph Lorusso, by Lilia Alvarado, beautiful lighting, sharp focus, 8k, high res, pores, sweaty, Masterpiece, Nikon Z9, Award - winning photograph’

negative prompt: ‘lowres, signs, memes, labels, text, food, text, error, mutant, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, made by children, caricature, ugly, boring, sketch, lacklustre, repetitive, cropped, (long neck), facebook, youtube, body horror, out of frame, mutilated, tiled, frame, border, porcelain skin, doll like, doll’

Just FYI, it looks like original prompt from the subreddit uses some tricks like prompt weighting not available in the default diffusers pipeline. You can either use the Automatic1111 gui or one of the custom diffusers pipelines from the diffusers github repository (link)

1 Like

Regarding dreambooth training, I noticed in my own experiments that you need to be really careful about overfitting.

Any tips to avoid overfitting?

I would first see where training for fewer steps gets you. I have not made a personal dreambooth training notebook for my notes repository yet and still need to settle on generally good training parameters. I can let you know when I do though.

Are you using the default parameters from the diffusers training script?

I used the instructions here and they seemed to work well for me. I haven’t tested it thoroughly though.
tldr:

OPTION 1: They're not looking like you at all! (Train longer, or get better training images)
OPTION 2: They're looking like you, but are all looking like your training images. (Train for less steps, get better training images, fix with prompting)
OPTION 3: They're looking like you, but not when you try different styles. (Train longer, get better training images)
1 Like

I am still baffled by these results. I have posted an example where very similar phrases in Portuguese yield very different results. There is no Portuguese words in the tokenizer vocabulary, but it understands some words and others it doesn’t.

In the linked example, I used the word “montando” which means “riding” and the model didn’t understand, but a less precise “andando a cavalo” which is also riding a horse but it would be literary translated as “walking by horse” it does understand. The model also understands that “astronauta” is “astronaut”, “cavalo” is “horse”.

1 Like

Cool thanks! I actually went through this guide, going to read through the docs on the main Dreambooth repo to see if I can optimize it

1 Like

I tried playing with interpolating seeds by interpolating the initial latent noises that seeds generate.

Linearly interpolating them doesn’t produce good results. I’m not sure how to precisely characterize the stats dynamic, but averaging two sets of normal distrubtion-sampled random numbers doesn’t give you a similarly-distributed random sample. Intuitively, if you average lots of samples from a normal distrubtion, you’ll get closer and closer to the mean. Interpolating between the two torch.randn-generated matrices makes for a latent that’s closer to the mean (0) and the diffusion model doesn’t understand that kind of noise, giving you uninteresting results. Here’s what it looks like.

You can see that at 4.5s, in the middle, there’s hardly any image at all.

So in lieu of interpolating the noise, I used a third noise to act as a mask. Instead of interpolating from one seed’s noise to the second’s, I thresholded the mask noise from zero to one. (I ended up doing a kind of soft-threshold, below.)

The mask looks like this, then, as we go from t_seed1 to t_seed2:

(There is a bit of interpolation in the “soft” part of the threshold.)

Here’s what it looks like using the above mask transition from one seed’s initial noise to the next.

I wish there was some way to geometrically interpolate high dimensional numbers. Can one slerp between 4096-dimensional vectors?..

edit: another improvement: blurring the mask quite significantly (gaussian blur with sigma (3,5))

Another idea: use coherent noise instead of randn, but still generate noise with the characteristics of randn. Then you might be able to evolve the noise smoothly…?

12 Likes

Added Part 4 - Aayush Agrawal - Stable diffusion using :hugs: Hugging Face - Variations of Stable Diffusion (aayushmnit.com)

3 Likes

This is interesting! Do you have a notebook you’d be willing to share to show how you explored this? A friend was interested in understanding how to think about the seed in terms of image output and I think he’d enjoy this.

Sure, Michael. I cleaned things up a bit and then as usual got distracted adding more and more ideas…so LMK if you find anything that’s terribly broken.

There’s a colab form to play with so it should be pretty straightforward to just play with if folks aren’t keen on getting deep into the code.

(Also somehow it’s 2022 and this is my first public Github repo! :tada:)

6 Likes

Thank you!! I’m going to share this with my friend. He’ll be psyched. I’ll report back with any learnings!

1 Like

Hi everyone - please take a look at this interesting behaviour of repeatedly applying diffusion:

Stationary and Stable points in the diffusion process

Idea:
If we take an image, apply diffusion to it, then we can use this new image as a starting point for another diffusion application.
If we repeat this many times, we end up with a series of related images.
What do these look like?

Method:

  1. Take any image as a starting point, encode it into the VAE latent space to get lat_0
  2. Add noise to lat_0, apply diffusion with a prompt to get latent lat_1
  3. Repeat step 2) with lat_1 to get lat_2
  4. Continue applying diffusion in a loop to get a sequence of latents [lat_0, …, lat_n]
  5. Convert these latents back to images

Case 1:

Apply the same noise every time in step 2) - i.e. use a fixed random seed in all diffusion processes.
In this case, the images converge very quickly to a single image!

[starting image: “mount fuji in spring”, diffusion prompt: “an astronaut on a horse, photo”, diffusion start step: 0, 60 diffusion loops]

This means that we have found a “fixed point” in the diffusion process: adding noise to this image and diffusing it outputs the same image again.

Here is what the latents look like after PCA reduction. The points seem to converge in latent space as well:

Question: does anyone have an intuition about why the images converge? I am yet to find a starting point/prompt that doesn’t converge.

Case 2:

Apply different noise in step 2) - i.e. change the random seed for every loop of the diffusion step.

Because the noise is fixed in case 1), perhaps it is not so surprising that a fixed point can be found.
Actually, if you apply a different noise to the fixed point image found in case 1), then diffuse, the output will be a completely different image. So the fixed point is unique to the noise used.

But are there fixed points if a different noise is applied on every diffusion run? It looks like there might be (try playing these videos at the same time):

[starting image: “london bus”, diffusion prompt: “cat”, diffusion start step: 10, 200 diffusion loops]

[starting image: “mount fuji in spring”, diffusion prompt: “cat”, diffusion start step: 10, 200 diffusion loops]

These two diffusion loops both start at different points (london bus vs mount fuji), and use different seeds throughout the diffusion processes.
However, they both end up stabilising at very similar images!:


[Step 200 images for two different diffusion loops]

They also both reach this stable point via a long period of black and white photos of a cat:


[Step 50 images for two different diffusion loops]

This pattern seems to be repeated for many different starting points/seeds for the diffusion process: initial image → black and white cat → flat block colour cat


[ Middle right cluster: initial images, Bottom left cluster: photo of black and white cat, Top left cluster: flat block colour cat. Transitions between them in latent space]

Here are some stable points for other prompts:

Dog:

Car:

Astronaut on a horse:

Future ideas:

There are lots of different variables to experiment with here. In particular, the start step of the diffusion process would be interesting to vary, and more diverse prompts.
I wonder if these fixed points could also give some insight into the structure of the diffusion process somehow.
See the notebook for more details.

Thanks @johnrobinsn for your tree diffusion notebook which was a help in this.

16 Likes

Hi,

I trained a textual inversion concept in the style of classic smurf cartoons.

Some nice results:

  1. Prompt: “san francisco in the style of < smurfy>”

  2. Prompt: “new york city in the style of < smurfy>”

  1. Prompt: “Paris in the style of < smurfy>”

It’s interesting that if I used a specific place or person as a prompt, I tended to get results that look like photos and lost the smurfy cartoon style.

For instance, “Paris opera house in the style of < smurfy>” resulted in the following unsmurfy images–

Similar photo-like results with some blue elements from “Sidney opera house in the style of < smurfy>”.

The model is here if anyone would like to play with it.

11 Likes

Hi @charlie - these look like nice interpolation results.

I wonder if this could also help for interpolation:

In the videos, you can see that there is quite good frame-to-frame consistency (though over time there is a lot of drift).

Some problems that need to be solved to get it working nicely for interpolation:

  1. The first few frames in the videos “jump around” a lot until they settle. Maybe your noise interpolation could help here?
  2. The process is not guided at the moment, so you end up at a stable point which looks quite different from the original image.

One potential solution for 2) is to run the diffusion loop on both the interpolation start image, and the final image separately. They will (hopefully) converge to a similar looking stable point (e.g. the black cat). If you then run the frames for the first diffusion loop, followed by reversing the frames for the second, then you get a path in latent space between the 2 images (via the stable point).

I trained textual inversion on ~50 images of my dog . I trained on a 3070 with 8 GB of VRAM so it took a bit of effort to get it running without memory errors:

10 Likes

I dusted off and revisited my DiffEdit implementation. After another pass I realized a few things that I think I got wrong originally that greatly improved my results so sharing…

  1. From reading the paper, I think I got hung up on the phrase, “taking the difference of the noise estimates” which I originally read as simple subtraction (with variants like subtract and then absolute value etc) and while that does work to some degree I found that if I treated the 4 channel “per pixel” latent value as a vector and taking the euclidean distance between the noise using that approach gave me much better results. This also gave a much better way to aggregate the channel information together and retained much more information. Earlier I was trying to do things like taking the mean or the max across the channels.

  2. Also In my earlier attempts, I had also pulled the ripcord too early and was trying to do a lot of my noise difference math in “image” (512x512) space. This is a big mistake because every pixel only respresents color space data, whereas if you do the mask differencing in latent space each “4 channel pixel” represents not only color data but also spatial data (8x8).

DiffEdit overall is still a little finicky so still alot of room for improving. But I wrote up a blog article and tweeted it out here if you’d like to give me a like :slight_smile:

17 Likes

Interesting, will check out your implementation for this. I really like the example oranges image.

1 Like

You need to adjust the standard deviations of the interpolated latents so that the result has the correct distribution (i.e sigma should be scheduler.init_noise_sigma). I managed to get something working in this notebook, in case you want to check it out.

Here are some results:

Thanks for sharing the idea! I had also encountered the problem when playing after lesson 1, and only thought of the solution when reading your post here :slight_smile: (collaborative thinking)

edit: I was thinking… it would be cool to have some other curve to interpolate between the two seeds, which enforces some property. eg: trying to make the “jumps” between each frame less abrupt. I’m not even sure what this would mean… but there must exist more interesting curves through latent space than a straight line. how to find them though… any ideas?

10 Likes

You need to adjust the standard deviations of the interpolated latents so that the result has the correct distribution (i.e sigma should be scheduler.init_noise_sigma )

Hah now that you say it it makes perfect sense, great insight! If the distribution isn’t what you need…make it into what you need!

2 Likes