Image Dataset for Finetuning

Hi Everyone,

I’ve been working with images from the free stock photos website Pexels in my experiments finetuning Stable Diffusion.

I have a large dataset of images and image attributes (e.g., descriptions, tags, etc.) and thought I would share it with the community. Specifically, the dataset contains 110,715 109,971 images (had to remove a few panoramic images). Each image has a title/prompt at minimum.

Kaggle Datasets

Sample Image with Depth Map

Sample Entry

img_id 3186010
title Pink and White Ice Cream Neon Signage
aspect_ratio 0.749809
main_color [128, 38, 77]
colors [#000000, #a52a2a, #bc8f8f, #c71585, #d02090, #d8bfd8]
tags [bright, chocolate, close-up, cold, cream, creamy, cup, dairy product, delicious, design, dessert, electricity, epicure, flavors, fluorescent, food, food photography, goody, hand, ice cream, icecream, illuminated, indulgence, light pink background, neon, neon lights, neon sign, pastry, pink background, pink wallpaper, scoop, sweet, sweets, tasty]
adult very_unlikely
aperture 1.8
camera iPhone X
focal_length 4.0
google_place_id ChIJkUjxJ7it1y0R4qOVTbWHlR4
iso 40.0
latitude -7.746914
longitude 113.226906
manufacturer Apple
medical very_unlikely
orientation 0.0
racy unlikely
software 13.1.3
spoof very_unlikely
violence very_unlikely
location Kecamatan Mayangan, Jawa Timur, Indonesia

I created two versions of the image dataset. One version has images with a minimum dimension of 512, while the other has a minimum length of 768. Both datasets have maximum sizes that are multiples of 32.

I already uploaded the Pandas DataFrame containing the image attributes to a GitHub Repository. I also included notebooks to explore the dataset and download source files from Pexels. Keep in mind that Pexels limits each IP address to about 500 downloads. You can use a VPN to get around this if needed. Also, the source files are rather large.

I plan to upload the two versions of the dataset to Kaggle as the Pexels license is quite generous regarding usage. However, I was hoping someone here would double-check the license’s wording to verify it’s allowed.


The 512p version is now available on Kaggle. The 768p version is about a third of the way uploaded.

1 Like

The 768p version finally finished processing on Kaggle.


I just finished my first attempt at finetuning the depth2img v2.0 model on a thousand images from the pexels dataset. I am currently uploading the depth images to Kaggle. Initial results seem ok, so I’ll test using Dreambooth on the finetuned model to add a new style.

1 Like

Here are the depth images for the 512p dataset.

Ok, this shows some promise.

1 Like

The current depth maps often result in losing far backgrounds, so I’ll need to investigate solutions to address that.

My goal for this project is to generate datasets of stylized images that I can then use to train a real-time style transfer model as an update to my old in-game style transfer tutorial.

1 Like

Thanks for sharing this

I concluded that the initial depth maps I generated suck, and I am currently making higher-quality ones. Hopefully, it will take less than a day to get through the pexels dataset.

The new depth maps seem to work better.

Sample image with new depth map

Result with old depth map

Result with new depth map


Thank you for sharing these datasets. Are the new depth maps updated in the links? Is the difference due to Midas vs BMD?

I am starting to fine-tune SD models as well! How far are you with your project?

They are newer, but I’m still but 100% satisfied with them.

There is another depth estimation model I came across that performs way better with most images but falls apart when there is too much blur. I have some ideas to remedy that, but have not had time to explore them yet.

I plan to upload my fine-tuning notebooks to github in the next few days

1 Like

I want to fine-tune depth2map too. Since I couldn’t find any existing notebook, I was planning to adapt this dreambooth NB into kohya-trainer (for non-square aspect ratio training images).

Looking forward to your notebooks. Happy to be an early tester and help contributing.

I still need to make a tutorial post, but you can check out some of the notebooks I use at the links below.

I have not had a chance to try newer fine-tuning methods like LoRA (Low-Rank Adaptation).

I have only used a batch size of 1 due to memory constraints, so I have not needed to do anything special for non-uniform aspect ratios.

Here is a screenshot of the gradio annotation interface.

I forgot to answer your question about how I made the depth images. The current versions on Kaggle still use Midas, just at higher resolutions. I have also tried LeReS and Boosting Monocular Depth (BMD). BMD is neat, but I need to find an image size that provides superior results while not taking an age to get through the dataset. LeRes is the one that is great for many images but falls apart with too much blur.

Also, use the latest CUDA versions included with the newest PyTorch version. Inference with the stable diffusion U-net is twice as fast for me, and training gets a lower (but still significant) speed improvement.


Thank you for sharing! I will try them out :slight_smile:

BTW, did you see this? Very interesting addition GitHub - lllyasviel/ControlNet: Let us control diffusion models


I saw that posted on the Stable Diffusion subreddit, but I have not had a chance to check it out. It does look promising, though.

I am currently cleaning up my existing Jupyter notebooks to turn some of them into blog posts. I accumulated over 70 notebooks during the course and whittled them back down to about 30. I want to finish that before starting new experiments.

1 Like

After that, I would like to see if I could use perceptual loss (i.e., activations from a VGG or other model) with ControlNet. That should be less hit-or-miss than relying on a depth estimation model.

I’ve been using ControlNet the past few days and absolutely love it. I think its such a clever idea for conditioning SD model generations! I’ve been applying it to some of my custom trained SD models and got excellent results with canny edges, HED, and depth options.

btw no training is required to use it!! The base ControlNet trained model is sufficient to guide an already trained SD model.