Questions on Finetuning Stable Diffusion

I would like to understand how to build build datasets for finetuning stable diffusion. I have the following questions and may I request someone to please help me with some answers please:

  1. If I understood it correctly we need pairs of images and text captions to finetune a Stable Diffusion model. Let us say I would like to finetune a SD model to generate high quality faces alone. I can collect a dataset of 1000+ faces. But then wouldn’t I need captions along with them to finetune the model? If yes, how can I get captions? Otherwise can I simply use a standard caption like ‘A photo of man’ or ‘A photo of a woman’?
  2. In general how do people finetune stable diffusion? Are there some repositories guiding people to do it or so?
  3. May I please ask you to please share any resources on efficiently finetuning stable diffusion?

Thank you very very much for your help!

1 Like

I saw this yesterday…


Thank you Ben. Let me check this out

I recommend the huggingface diffusers repo as they have many well documented example scripts for finetuning stable diffusion among other tasks.