How to achieve style transfer with Stable Diffusion Image to Image

Hi, I’ve tried playing with the image-to-image generation of stable diffusion, with different guidance scales, but I haven’t been able to get good results where the original image’s features are maintained, along with the generated image having new style characteristics from the given prompt.

Methods like dreambooth are one way to solve this, but wanted to know if there’s a better way for this.

Is adding a perceptual loss during the diffusion steps a good way for this?

1 Like

You might find this video interesting.

In my personal experiments (unrelated to the above video), I have not had much trouble maintaining the source image features with image-to-image when I keep the strength parameter around 0.5-0.6. It does alter some features though, like the collar in the example below.

I have not had much success using VGG model features like typical style transfer methods to control the Stable Diffusion output. There’s likely something wrong with my implementation, but it does not seem promising.

I also tried encoding the style image and using that to calculate the loss. That had some interesting results, but not what I wanted either.


With the last one (encoding the style image and using that to calculate the loss), if you multiply your loss by a factor of 3 or 4 it seems have some more interesting results - using your notebook with the mosaic example and the camp fire prompt:




Yep and I also have not tried implementing gradient checkpointing so that I could enable gradients for the unet as well. That is likely a big limiting factor, but it requires a lot more memory.

Very cool!



VAE encoding + vgg feature loss


I’ve been experimenting more with the feature/perceptual loss additional guidance and scaling the loss differently for each of the relevant VGG layers outputs similar to the content and style feature loss in the 2018 lesson 7 course. I think the results are interesting so far.