How to achieve style transfer with Stable Diffusion Image to Image

eldor4do · November 5, 2022, 2:38pm

Hi, I’ve tried playing with the image-to-image generation of stable diffusion, with different guidance scales, but I haven’t been able to get good results where the original image’s features are maintained, along with the generated image having new style characteristics from the given prompt.

Methods like dreambooth are one way to solve this, but wanted to know if there’s a better way for this.

Is adding a perceptual loss during the diffusion steps a good way for this?

cjmills · November 7, 2022, 1:10am

You might find this video interesting.

Conditioning Mask Strength is AWESOME! - AUTOMATIC1111

In my personal experiments (unrelated to the above video), I have not had much trouble maintaining the source image features with image-to-image when I keep the strength parameter around 0.5-0.6. It does alter some features though, like the collar in the example below.

Prompt: ‘Tiny cute 3D felt fiber cat, made from Felt fibers, a 3D render, trending on cgsociety, rendered in maya, rendered in cinema4d, made of yarn, square image’
Image:

cat_crop513×512 56.4 KB
Output:

ea160d42e0b86810f72c9d29512×512 356 KB
Notebook

I have not had much success using VGG model features like typical style transfer methods to control the Stable Diffusion output. There’s likely something wrong with my implementation, but it does not seem promising.

Baseline:

2902ef2f2e134a8bb05bf7d5512×512 462 KB
Style Image:

mosaic512×512 63.8 KB
Output:

2f876657460d6ec5f5c6fe23512×512 52.7 KB
Notebook

I also tried encoding the style image and using that to calculate the loss. That had some interesting results, but not what I wanted either.

Same baseline and style image as above
Output:

7add7a0e49484d62d6ee9fad512×512 441 KB
Notebook

christopherthomas · November 7, 2022, 11:42pm

With the last one (encoding the style image and using that to calculate the loss), if you multiply your loss by a factor of 3 or 4 it seems have some more interesting results - using your notebook with the mosaic example and the camp fire prompt:

x3:

x4:

cjmills · November 7, 2022, 11:55pm

Yep and I also have not tried implementing gradient checkpointing so that I could enable gradients for the unet as well. That is likely a big limiting factor, but it requires a lot more memory.

johnrobinsn · November 8, 2022, 12:01am

Very cool!

cjmills · November 9, 2022, 1:47am

7b25c6a3ff18d03e859f1cde

036bbde5aac9df1aad58fdb1

cjmills · November 9, 2022, 2:18am

VAE encoding + vgg feature loss

c48248cdd5f22955406264bc

christopherthomas · November 9, 2022, 11:26pm

I’ve been experimenting more with the feature/perceptual loss additional guidance and scaling the loss differently for each of the relevant VGG layers outputs similar to the content and style feature loss in the 2018 lesson 7 course. I think the results are interesting so far.