Share your work here ✅ (Part 2 2022)

I tried playing with interpolating seeds by interpolating the initial latent noises that seeds generate.

Linearly interpolating them doesn’t produce good results. I’m not sure how to precisely characterize the stats dynamic, but averaging two sets of normal distrubtion-sampled random numbers doesn’t give you a similarly-distributed random sample. Intuitively, if you average lots of samples from a normal distrubtion, you’ll get closer and closer to the mean. Interpolating between the two torch.randn-generated matrices makes for a latent that’s closer to the mean (0) and the diffusion model doesn’t understand that kind of noise, giving you uninteresting results. Here’s what it looks like.

You can see that at 4.5s, in the middle, there’s hardly any image at all.

So in lieu of interpolating the noise, I used a third noise to act as a mask. Instead of interpolating from one seed’s noise to the second’s, I thresholded the mask noise from zero to one. (I ended up doing a kind of soft-threshold, below.)

The mask looks like this, then, as we go from t_seed1 to t_seed2:

(There is a bit of interpolation in the “soft” part of the threshold.)

Here’s what it looks like using the above mask transition from one seed’s initial noise to the next.

I wish there was some way to geometrically interpolate high dimensional numbers. Can one slerp between 4096-dimensional vectors?..

edit: another improvement: blurring the mask quite significantly (gaussian blur with sigma (3,5))

Another idea: use coherent noise instead of randn, but still generate noise with the characteristics of randn. Then you might be able to evolve the noise smoothly…?

12 Likes

Added Part 4 - Aayush Agrawal - Stable diffusion using :hugs: Hugging Face - Variations of Stable Diffusion (aayushmnit.com)

3 Likes

This is interesting! Do you have a notebook you’d be willing to share to show how you explored this? A friend was interested in understanding how to think about the seed in terms of image output and I think he’d enjoy this.

Sure, Michael. I cleaned things up a bit and then as usual got distracted adding more and more ideas…so LMK if you find anything that’s terribly broken.

There’s a colab form to play with so it should be pretty straightforward to just play with if folks aren’t keen on getting deep into the code.

(Also somehow it’s 2022 and this is my first public Github repo! :tada:)

6 Likes

Thank you!! I’m going to share this with my friend. He’ll be psyched. I’ll report back with any learnings!

1 Like

Hi everyone - please take a look at this interesting behaviour of repeatedly applying diffusion:

Stationary and Stable points in the diffusion process

Idea:
If we take an image, apply diffusion to it, then we can use this new image as a starting point for another diffusion application.
If we repeat this many times, we end up with a series of related images.
What do these look like?

Method:

  1. Take any image as a starting point, encode it into the VAE latent space to get lat_0
  2. Add noise to lat_0, apply diffusion with a prompt to get latent lat_1
  3. Repeat step 2) with lat_1 to get lat_2
  4. Continue applying diffusion in a loop to get a sequence of latents [lat_0, …, lat_n]
  5. Convert these latents back to images

Case 1:

Apply the same noise every time in step 2) - i.e. use a fixed random seed in all diffusion processes.
In this case, the images converge very quickly to a single image!

[starting image: “mount fuji in spring”, diffusion prompt: “an astronaut on a horse, photo”, diffusion start step: 0, 60 diffusion loops]

This means that we have found a “fixed point” in the diffusion process: adding noise to this image and diffusing it outputs the same image again.

Here is what the latents look like after PCA reduction. The points seem to converge in latent space as well:

Question: does anyone have an intuition about why the images converge? I am yet to find a starting point/prompt that doesn’t converge.

Case 2:

Apply different noise in step 2) - i.e. change the random seed for every loop of the diffusion step.

Because the noise is fixed in case 1), perhaps it is not so surprising that a fixed point can be found.
Actually, if you apply a different noise to the fixed point image found in case 1), then diffuse, the output will be a completely different image. So the fixed point is unique to the noise used.

But are there fixed points if a different noise is applied on every diffusion run? It looks like there might be (try playing these videos at the same time):

[starting image: “london bus”, diffusion prompt: “cat”, diffusion start step: 10, 200 diffusion loops]

[starting image: “mount fuji in spring”, diffusion prompt: “cat”, diffusion start step: 10, 200 diffusion loops]

These two diffusion loops both start at different points (london bus vs mount fuji), and use different seeds throughout the diffusion processes.
However, they both end up stabilising at very similar images!:


[Step 200 images for two different diffusion loops]

They also both reach this stable point via a long period of black and white photos of a cat:


[Step 50 images for two different diffusion loops]

This pattern seems to be repeated for many different starting points/seeds for the diffusion process: initial image → black and white cat → flat block colour cat


[ Middle right cluster: initial images, Bottom left cluster: photo of black and white cat, Top left cluster: flat block colour cat. Transitions between them in latent space]

Here are some stable points for other prompts:

Dog:

Car:

Astronaut on a horse:

Future ideas:

There are lots of different variables to experiment with here. In particular, the start step of the diffusion process would be interesting to vary, and more diverse prompts.
I wonder if these fixed points could also give some insight into the structure of the diffusion process somehow.
See the notebook for more details.

Thanks @johnrobinsn for your tree diffusion notebook which was a help in this.

16 Likes

Hi,

I trained a textual inversion concept in the style of classic smurf cartoons.

Some nice results:

  1. Prompt: “san francisco in the style of < smurfy>”

  2. Prompt: “new york city in the style of < smurfy>”

  1. Prompt: “Paris in the style of < smurfy>”

It’s interesting that if I used a specific place or person as a prompt, I tended to get results that look like photos and lost the smurfy cartoon style.

For instance, “Paris opera house in the style of < smurfy>” resulted in the following unsmurfy images–

Similar photo-like results with some blue elements from “Sidney opera house in the style of < smurfy>”.

The model is here if anyone would like to play with it.

11 Likes

Hi @charlie - these look like nice interpolation results.

I wonder if this could also help for interpolation:

In the videos, you can see that there is quite good frame-to-frame consistency (though over time there is a lot of drift).

Some problems that need to be solved to get it working nicely for interpolation:

  1. The first few frames in the videos “jump around” a lot until they settle. Maybe your noise interpolation could help here?
  2. The process is not guided at the moment, so you end up at a stable point which looks quite different from the original image.

One potential solution for 2) is to run the diffusion loop on both the interpolation start image, and the final image separately. They will (hopefully) converge to a similar looking stable point (e.g. the black cat). If you then run the frames for the first diffusion loop, followed by reversing the frames for the second, then you get a path in latent space between the 2 images (via the stable point).

I trained textual inversion on ~50 images of my dog . I trained on a 3070 with 8 GB of VRAM so it took a bit of effort to get it running without memory errors:

10 Likes

I dusted off and revisited my DiffEdit implementation. After another pass I realized a few things that I think I got wrong originally that greatly improved my results so sharing…

  1. From reading the paper, I think I got hung up on the phrase, “taking the difference of the noise estimates” which I originally read as simple subtraction (with variants like subtract and then absolute value etc) and while that does work to some degree I found that if I treated the 4 channel “per pixel” latent value as a vector and taking the euclidean distance between the noise using that approach gave me much better results. This also gave a much better way to aggregate the channel information together and retained much more information. Earlier I was trying to do things like taking the mean or the max across the channels.

  2. Also In my earlier attempts, I had also pulled the ripcord too early and was trying to do a lot of my noise difference math in “image” (512x512) space. This is a big mistake because every pixel only respresents color space data, whereas if you do the mask differencing in latent space each “4 channel pixel” represents not only color data but also spatial data (8x8).

DiffEdit overall is still a little finicky so still alot of room for improving. But I wrote up a blog article and tweeted it out here if you’d like to give me a like :slight_smile:

17 Likes

Interesting, will check out your implementation for this. I really like the example oranges image.

1 Like

You need to adjust the standard deviations of the interpolated latents so that the result has the correct distribution (i.e sigma should be scheduler.init_noise_sigma). I managed to get something working in this notebook, in case you want to check it out.

Here are some results:

Thanks for sharing the idea! I had also encountered the problem when playing after lesson 1, and only thought of the solution when reading your post here :slight_smile: (collaborative thinking)

edit: I was thinking… it would be cool to have some other curve to interpolate between the two seeds, which enforces some property. eg: trying to make the “jumps” between each frame less abrupt. I’m not even sure what this would mean… but there must exist more interesting curves through latent space than a straight line. how to find them though… any ideas?

10 Likes

You need to adjust the standard deviations of the interpolated latents so that the result has the correct distribution (i.e sigma should be scheduler.init_noise_sigma )

Hah now that you say it it makes perfect sense, great insight! If the distribution isn’t what you need…make it into what you need!

2 Likes

How about this curve interpolating between prompt “cat” for two different seeds?
I think it has fairly good frame-to-frame consistency…

6 Likes

what curve did you use for that?

Here are some additional interpolations I did, using other prompts:
prompt:
“solarpunk cute farmer robot. pixar style. octane render. 8k. palette: white, gold, dark green”


prompt:
“blue amanita muscaria. digital painting. artstation trending”

I have a feeling there should be other interesting ways to explore the latent space more intelligently to better control image generation results. Maybe I’m not understanding latent vectors correctly, though. I’ll run some experiments and report back.

2 Likes

Hi all!

I’m also playing with some latent space interpolation.

In case it helps others, I put together a PyTorch-friendly version of SLERP. It is based on code I found in the pytorch forums together with a popular numpy SLERP implementation. The code here avoids casting tensors to/from numpy, but still has the appropriate threshold check for when vectors are too close to parallel.

def slerp(v1, v2, t, DOT_THR=0.9995):
    """SLERP for pytorch tensors interpolating `v1` to `v2` with scale of `t`.

    Reference: https://splines.readthedocs.io/en/latest/rotation/slerp.html
    """
    # take the dot product between normalized vectors
    dot = torch.mul(v1/torch.linalg.norm(v1), v2/torch.linalg.norm(v2)).sum()
    
    # if the vectors are too close, return a simple linear interpolation
    if torch.abs(dot) > DOT_THR:
        res = (1 - t) * v1 + t * v2    
    
    # else, apply SLERP
    else:
        # compute the angle terms we need
        theta   = torch.acos(dot)
        theta_t = theta * t
        sin_theta   = torch.sin(theta)
        sin_theta_t = torch.sin(theta_t)
        
        # compute the sin() scaling terms for the vectors
        s1 = torch.sin(theta - theta_t) / sin_theta
        s2 = sin_theta_t / sin_theta
        
        # interpolate the vectors
        res = (s1 * v1) + (s2 * v2)
    return res

I’ll have some more examples and functions in a blog post, hopefully soon

6 Likes

It is a combination of two curves:

  1. There is a “natural” curve in latent space for all latents: Take a latent, diffuse it, then map the resulting image back to latent space (and repeat). This tends to take the images into a “cartoon” region of the latent space. See this for more details:
    Share your work here ✅ (Part 2 2022) - #156

  2. Interpolation: once both images are in cartoon space, interpolating between the images seems to work better.

Here are some other ideas that may improve interpolation:

  • Instead of interpolating along a straight line between fixed latents, update the latents after each diffusion run (as in 1)) and interpolate between the updated latents. I think this helps keep frame to frame consitency.
  • Instead of interpolating from img1 latents to img2 latent directly, take turns updating the latents and reverse the direction of interpolation at each step. One update step for img1 → img2, then one for img2 → img1.
  • During this process, gradually relax the guidance scale down to 0. This allows a nice, abstract transition of the image without it having to follow any prompts.
  • Eventually the two diffusion processes converge on the same point in latent space (this is the middle of the video - usually the most abstract part).
  • Concat the diffusions together, reversing one of them.

Here is one result using all of the above:

10 Likes

Hey Folks,

I just finished writing the DiffEdit blog - Stable diffusion using :hugs: Hugging Face - DiffEdit paper implementation. Here are the highlights, I first do a purist implementation and then propose a FastDiffEdit method which significantly reduces the mask creation time. (5x improvement).

Few learnings while doing the purist implementation which I got wrong when I did implementation the first time -

  • When we do iterate 10 times for getting the difference, we get 10 4x64x64 latents, instead of directly averaging them if we do averages of absolute value it gives much better results
  • Everybody is using the min-max scaler for normalization, I realized that this scaling technique is not great as it just brings the values b/w 0 and 1 and does not change the distribution of values itself. Hence, there is no guarantee that the 0.5 thresholds to binarize will hold true. Instead, if you use z/standard scaling that guarantees the final values to be distributed around zero and we can safely take a >0 threshold for the binarization step.

The above two changes are producing much better masks than any implementation I did previously.

Next, I propose a new method called FastDiffEdit . realizing that the masking process is extremely slow (takes ~50 sec on my machine). The reason is we are running a diffusion loop (25 steps for each iteration => a total of 250 steps). My take is we don’t need to run a full diffusion loop to denoise the image but just use the U-net prediction of the original sample in one shot and increase the repetition to 20 times. In this case, we can improve the computation from 10*25 = 250 steps to 20 steps (12x less loop) and the mask produced is more or less the same. Here is an example =>

Hope you will enjoy reading this blog.

19 Likes

That is actually false.


tokenizer=pipe.tokenizer
text_input = tokenizer("uma foto de um astronauta montando um cavalo", padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")

text_input.input_ids

yields…
tensor([[49406, 9256, 12823, 654, 1008, 7982, 627, 1397, 5186, 5986, 1008, 4942, 6183, 49407, 49407, ..., 49407]])

and tokenizer.decode([7982, 627 , 1397]) results in astronauta (7982: “astron”, 627: “au”, 1397: “ta”).In english there is a specific token for astronaut, token 18376.

The text_encoder will generate similar embeddings for prompts in different languages.

I had a similar thought about speeding up, as in why run the loop for each step instead of taking the prediction for the step we are interested in, but never followed through on experimenting with it :slight_smile: Looking forward to reading your article over the weekend and doing some experimenting …