Creating Synthetic Data with Stable Diffusion LORAS

Ewbank · April 9, 2024, 11:30am

Hi everyone! I’ve been working my way through the fastai course and went off on a bit of a side track, experimenting with whether I could improve my results in the Paddy Rice Kaggle Competition using synthetic data generated from Stable Diffusion images.

I created a LORA by training on the real Tungro grass images. To enable some control over the images I used, I clustered them, and when I fed each image into the LORA, I appended its cluster name to the prompt.

For those who don’t know, when you train a LORA with Stable Diffusion, you can feed in a prompt for each image the model is trained on, like “A picture of grass in a field, close up, muddy water, highly detailed veins, etc.” This prompt that you feed in enables you to have good control over the images you generate with the LORA.

So, I decided to feed in the cluster number of the real images that I had grouped. I had hoped that by inputting “cluster1, cluster2, etc.,” I could then generate images that were a combination of the clusters by using prompts like “cluster1 cluster18, tungro grass disease…” for my generation of images.

I found that “cluster1,” “cluster2,” etc., became very strong embeddings with a small number of images, far stronger than I had anticipated. This allowed for a good degree of control over the images generated with the synthetic data.

I spent some time thinking and trying to come up with a way to check the performance of synthetic + real images vs. real images in training. What I ended up with was using Jeremy’s “road to the top” notebook, but replacing his folder split with a grandparent splitter:

python

def train(arch, size, path, train='train', valid='valid', item=Resize(480, method='squish'), finetune=True, epochs=10, accum=1):
    dls = ImageDataLoaders.from_folder(
        path,
        item_tfms=item,
        batch_tfms=aug_transforms(size=size, min_scale=0.75),
        bs=64//accum,
        splitter=GrandparentSplitter(train_name=train, valid_name=valid)
    )
    cbs = GradientAccumulation(64) if accum else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    if finetune:
        learn.fine_tune(epochs, 1e-3)  # Use a lower learning rate
        tta_result = learn.tta(dl=dls.test_dl(tst_files))
        return learn, tta_result
    else:
        learn.unfreeze()
        learn.fit_one_cycle(epochs, 1e-3)  # Use a lower maximum learning rate
        return learn, None

To make the test as fair as possible, I created two identical training sets with the same test and validation sets, but I added my synthetic Tungro images to one set. Then, I ran each model with the same architectures for 10 epochs.

My synthetic + real set beats the just real set every time I run it with different architectures by about 0.3%, which seems trivial at first glance. However, I have only added synthetic data to 1/10 of the possible categories, and I trained the images using a Stable Diffusion V1.5 model, not an SDXL one, which should produce considerably better results.

For those interested, the results with convnext_large_in22k, 10 epochs, and TTA were:

With synthetic: 97.580%
Non-synthetic: 97.235%

I’m wondering if anyone has any suggestions of what I could try or if anyone wants to know more or wants to get involved?

Cheers
John

vbakshi · April 9, 2024, 4:49pm

That sounds like an awesome approach, thanks for sharing! I’ve seen people create synthetic data for LM-based competitions before but this is the first I’ve seen that approach for image classification. Did the higher accuracy translate to a better Kaggle Public and Private score as well?

Ewbank · April 10, 2024, 7:44pm

So those were the private scores, I hadn’t realised but the public scores are reversed.

Public Score

With synthetic: 97.577%
Non-synthetic: 97.731%

I realised after I posted that, that maybe what I had been doing lacked a little bit of direction. I should have looked at the confusion matrix prior to adding the synthetic data to see where the model was getting lost.

I also realised, that the paddy disease classification may be a bad task to try this on, as it is almost a solved problem and there is a lot of data…

I actually made my initial synthetic images a few months ago, before lots of recent advancements in stable diffusion. Most notably the ipadapter plugin for comfyui that cubiq (Matteo Spinelli) · GitHub continues to improve on - https://www.youtube.com/watch?v=czcgJnoDVd4

I’m going to keep chipping away at it, and see if I can’t come up with a better method. I have automated lots of it now, and understand the problem better, so hopefully can produce something a little more interesting shortly.