Has anyone tried fine-tunning diffusion models? If so what tips/learning can you share?

Hi, community,

I am having a lot of fun exploring Lesson 10, 10a, and 10b. Thank you for putting them together!

stable_diffusion.ipynb uses a fine-tunned dreambooth and mentions that the fine-tuning process is tricky, but does not go into much detail beyond that.

Has anyone attempted to fine-tune a dreambooth or other diffusion models to recognize new vocabulary terms? If so, how did it go? What hyperparameters did you use? What worked well and what did not? Did you happen to learn something interesting?

Here’s what I’ve tried so far:

  1. I tried achieving an effect similar to that of a fine-tunned dreambooth by setting the starting image to that of the target person (i.e. my new vocabulary term of interest). I expected the model to preserve the main human features but to apply the style on top of it. This did not work, as the model either overwrote the human features completely or did not apply any style features. Here’s the result of 50 inference steps with strength varying from 0.0 to 1.0 in steps of 0.1 (same strength for all images on the same row):
  2. I ran the textual_inversion.py with the recommended arguments (replacing the data w/ my data and initializer token w/ “man”):
accelerate launch textual_inversion.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --train_data_dir=$DATA_DIR \
  --learnable_property="object" \
  --placeholder_token="<cat-toy>" --initializer_token="toy" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=3000 \
  --learning_rate=5.0e-04 --scale_lr \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \

Further work

Here’s what I’m planning to try next:

  1. Use dreambooth instead of textual-inversion. Just writing this up made me realise that I was running a different model (textual-inversion instead of dreambooth) from what the Lesson 10 notebook suggested.
  2. Try using Flax instead of pytorch as the readme suggests 70% reduction in training time.

I’m trying this. How to fine tune stable diffusion: how we made the text-to-pokemon model at Lambda. Will post any results.

1 Like

How did your results look like? Any tips/tricks to share?

If you’ve tried and it doesn’t work, let me know. I’ll try to clean things up and write a blog to outline the steps, since some of it was trial and error. my results matched the lambda labs ones

1 Like

I’ve been obsessively fine tuning models over the past 2 or so months and I’m still hardly an expert, though I’ve found various tricks through reading what the “experts” have tried and then tinkering with their setups and parameters.

There are 3 types of models that I’ve tried to train: Dreambooth, Textual Inversion, and Hypernetworks. I’ve spent a good deal of the recent past on Dreambooth and just got started with Textual Inversion. I’ve tried training some Hypernetworks as well and although they seem to be improving my images, I’m not sure if I’m even training them correctly.

I’m not sure where to begin but I’ll post the most helpful resources I’ve found as well as any findings and/or further adjustments that made the training go more smoothly. Some of these are recent, some may be a few months old and probably outdated but they still seem to work for me.



  • Use the following formula:
    • Number of subject images (instance) = N
    • Number of class images (regularization) = N x 12
    • Maximum number of Steps = N x 80 (this is what I’m tweaking right now but between 80 and 100 should be enough)
    • Learning rate = 1e-6
    • Learning rate schedule = polynomial
    • Learning rate warmup steps = Steps / 10
  • Dreambooth is probably the easiest and fastest way to train SD to generate one type and just the one type of image very well, but it tends to overfit and “forget” other parts of the network. It sounds counterintuitive but less is more when it comes to the number of images you train it on; pick high quality images that represent a wide range of styles, settings, etc. that you expect your subject to be in. Rumor says that F222, the most popular NSFW model, was trained in Dreambooth using only 11 images!
  • Regularization images help mitigate some of the undesirable side effects of Dreambooth including overfitting and “forgetting” how to generate related images.
    • By default, regularization images will be generated using your SD class prompt if you didn’t provide enough of them, at least if you’re using Shivam’s implementation from the Reddit post.
    • They should represent the general case of what you’re trying to train. For example, if you’re trying to train your face, use similar faces as your regularization images, preferably from people from your gender, race, age group, hair style, etc.
    • If you’re gonna generate the regularization images yourself, make sure you only use the high quality ones and toss the ones with obvious mistakes like people with weird limbs.
    • Although it’s not the intended use for regularization images, I found they can be a way to sneak a new style into your model. Use images that emphasize a particular style even if they weren’t generated by SD in your regularization images and your final model will likely produce images of that style. I don’t recommend doing this in the beginning though but it’s something I discovered by accident.
  • Instance/Class labels
    • The instance prompt is what you’ll train the model to look for when you want to use the model to generate images similar to what you trained it on. You might use something like “zwx cat” for a particular cat that you intend to train your modeo on.
    • The class label is the general case of what you’re training. Simply “cat” may be fine in your “zwx cat” example although if zwx cat is a black cat, you can try to be more specific like “black cat”.

Textual Inversion:



  • You can train directly from Automatic1111 instead of using a separate notebook.
  • It’s much easier (but still a lot of work) to train, say, 100 Textual Inversion models for 100 different faces than to train a Dreambooth model for the same 100 faces.
  • Textual inversion can only generate the things that the base model you’re using is trained on, but it makes it easier to do so. Dreambooth can generate things that the base model has never seen before.
  • In theory, Textual Inversion should only work on the model it was trained on. In practice, it works, though sometimes in a limited capacity, on other models that share the same base model (e.g. SD 1.5) as long as it was trained on SD 1.5.
  • The files are much smaller and you can combine multiple inversions at once whereas with Dreambooth, you have to keep swapping models.
  • You need to specify a vector size when you first train your model. Keep in mind that these vector take up your allotted 75 max vectors in your prompt. If you’re using a vector size of 16 for your TI class “zwx”, then mentioning “zwx” in your prompt essentially will use up the same # of vectors as, say, 16 normal words. In practice, try to keep vectors to a minimum, usually up to 16.
  • Some tips for labeling images:
    • Put [name] somewhere in the label so that the model will associate your image with your trigger word.
    • It sounds counter-intuitive but DO NOT describe the things that you want the model to generalize! For example, if you’re training on images of cats wearing hats, DO NOT label a particular image as [name] cat in a hat. Simply [name] cat (or even just [name]). It’s not the end of the world if you described the image in detail but if you labeled a bunch of cats wearing hats as [name] cat in a hat, you’ll need to prompt for [name] cat in a hat instead of just simply [name] when you want to use the Textual Inversion.
  • Before training, you’ll want to have xformers installed, enable applying cross attention in the settings, and have it move the VAE to RAM to maximize the amount of GPU ram you have available for training.
  • See the Reddit post for the optimal settings BUT one thing that seems to make a huge difference is gradient accumulation. You also want to use the maximum batch size you can get away with (with Google Colab on a 16 GB GPU, it’s probably around 11 or 12.) If you run out of GPU ram trying to train, please interrupt the execution in your notebook and restart the engine again to clear out the GPU ram. Also, keep in mind that (batch size)*(gradient accumulation) <= (your # of images). Try to maximize your batch size first then solve for the max gradient accumulation that’ll respect the formula above.
  • You’ll find that even if you follow the recommended path set by the Reddit post above, your training still might not work. Unfortunately, there’s no one size fits all approach so it’s important to check your results and adapt. Is there a checkpoint somewhere in the middle that gave half decent results? If so, resume training from that one with a lower learning rate.
  • I also seem to be having some success using a cyclical learning rate (that Jeremy has been harping about constantly in Fastai) though it can be a bit tedious to implement. You can try the following “cyclical” schedule in the learning rate field if you’ll be training for 2000 steps:
5e-2:10, 5e-3:150, 5e-4:200, 5e-2:210, 5e-4:300, 5e-2:310, 5e-4:400, 5e-2:410, 5e-4:500, 5e-2:510, 5e-4:600, 5e-2:610, 5e-4:700, 5e-2:710, 5e-4:800, 5e-2:810, 5e-4:900, 5e-2:910, 5e-4:1000, 5e-3:1010, 5e-5:1100, 5e-3:1110, 5e-5:1200, 5e-3:1210, 5e-5:1300, 5e-3:1310, 5e-5:1400, 5e-3:1410, 5e-5:1500, 5e-3:1510, 5e-5:1600, 5e-3:1610, 5e-5:1700, 5e-3:1710, 5e-5:1800, 5e-3:1810, 5e-5:1900, 5e-3:1910, 5e-5:2000