Hi, I’ve tried playing with the image-to-image generation of stable diffusion, with different guidance scales, but I haven’t been able to get good results where the original image’s features are maintained, along with the generated image having new style characteristics from the given prompt.
Methods like dreambooth are one way to solve this, but wanted to know if there’s a better way for this.
Is adding a perceptual loss during the diffusion steps a good way for this?
In my personal experiments (unrelated to the above video), I have not had much trouble maintaining the source image features with image-to-image when I keep the strength parameter around 0.5-0.6. It does alter some features though, like the collar in the example below.
Prompt: ‘Tiny cute 3D felt fiber cat, made from Felt fibers, a 3D render, trending on cgsociety, rendered in maya, rendered in cinema4d, made of yarn, square image’
I have not had much success using VGG model features like typical style transfer methods to control the Stable Diffusion output. There’s likely something wrong with my implementation, but it does not seem promising.
With the last one (encoding the style image and using that to calculate the loss), if you multiply your loss by a factor of 3 or 4 it seems have some more interesting results - using your notebook with the mosaic example and the camp fire prompt:
Yep and I also have not tried implementing gradient checkpointing so that I could enable gradients for the unet as well. That is likely a big limiting factor, but it requires a lot more memory.
I’ve been experimenting more with the feature/perceptual loss additional guidance and scaling the loss differently for each of the relevant VGG layers outputs similar to the content and style feature loss in the 2018 lesson 7 course. I think the results are interesting so far.