I would like to understand how to build build datasets for finetuning stable diffusion. I have the following questions and may I request someone to please help me with some answers please:
If I understood it correctly we need pairs of images and text captions to finetune a Stable Diffusion model. Let us say I would like to finetune a SD model to generate high quality faces alone. I can collect a dataset of 1000+ faces. But then wouldn’t I need captions along with them to finetune the model? If yes, how can I get captions? Otherwise can I simply use a standard caption like ‘A photo of man’ or ‘A photo of a woman’?
In general how do people finetune stable diffusion? Are there some repositories guiding people to do it or so?
May I please ask you to please share any resources on efficiently finetuning stable diffusion?