LR finder and small batch sizes

Sometimes when I use a large complex model, I need to use a small batch size to have my data fit into memory.

When using the LR finder with small batch sizes, the loss jumps around a lot and it becomes difficult to pick a good LR.

What are some strategies around this without needing to increase the batch size?

For example how would I pick an LR from this:


In this case I would try out 1e-4. But try out higher learning rates and if it doesn’t overfit or the loss doesn’t diverge, it’s probably good.

I agree, with low batch-size it’s harder to determine learning rate. One possible solution might be to utilize gradient accumulation to simulate higher batch sizes just for calculating the learning rate. However, the optimal learning rate is proportional to batch size so you would have to divide the learning rate by the same factor ex: gradient accumulation to simulate bs=32, find learning rate, divide by 2, and use for bs=16.


Thanks @ilovescience I’ll try that and see if that helps. I’ve just been using trial and error for the moment, but was just wondering if there was another way.

Looking around the forums, it sounds like for gradient accumulation I might need to write a custom training loop? Or maybe use callbacks?

It’s already implemented. But supposedly there are some instabilities with batchnorm. But feel free to try it out.


Thanks! Even better :slight_smile: I’ll try that one out and see if that helps

1 Like

Yea it looks like if I wanted to use the accumulator with EfficientNets it might still be a little unstable if we are using BatchNorm?

This kernel has some version of gradient accumulation that seemed to work for an actual use case with batch-norm. But honestly I have no clue I have never tried this before so I can’t guarantee any of these callbacks will work properly.

Just reporting back on some experiment runs I tried.

With the code below, using bs=16, with a step of 8 to try for an effective bs of 128 with efficientnet b5:

callback_fns = [ShowGraph, partial(AccumulateScheduler, n_step=8)]

learner = Learner(data_bunch_cleaned, arch, model_dir="models", loss_func=MSELossFlat(reduction='sum'), metrics=quadratic_kappa, callback_fns=callback_fns)

I ran the LR Finder and from this loss graph I picked 1e-4

Then I tried to train for 1e-4 then tried your suggestion of running it in proportion to BS so I divided it by the step 1e-4/8

Both results returned poor accuracies.

1 Like

Thanks for sharing your results. Unfortunately, I don’t know what the problem is.

Could you please share the loss curves and the curves of the metric?

EfficientNets are also notoriously hard to train. I would also try training with an EfficientNetB0 first and when things look good then scale up. I have done a lot of experiments with EfficientNets but based on the couple I have done and the discussion I have seen here, results on smaller EfficientNet models usually scale up pretty well.

Finally, what is the dataset you are using? Training difficulties may also be due to properties of the dataset. Given that you are using quadratic kappa, I am going to guess that you are using a diabetic retinopathy dataset?

1 Like

Oh yea sure I’ll try that, I started with a effnet-b2 network actually but with a bs=64. But I actually got slightly better results with a effnet-b5 with a bs=16, so I began training more with that.

And yes you are right, I am using a combined 2015 and 2019 diabetic retinopathy dataset and using 224px. I preprocessed and re-saved the original images with crop + resize at 224px.

Maybe I’ll try some smaller efficientnets.

Below are some loss curves I’ve been getting without using the Accumulator (I’m training on a single RTX2070 with fp16, fit_one_cycle, frozen). I ran only one epoch with the Accumulator and got a loss of 13, so I didn’t bother training any further.

effnet-b2 bs=64:

effnet-b5 bs=16

Also, what makes Effnets difficult to train? Is there anything you think I should watch out for that are specific to training Effnets?

OK you are only training for 3 epochs. For those three epochs QWK = 0.65 actually doesn’t seem terribly bad. Note in my experiments I trained on the previous dataset for like 10 epochs, then the 2019 dataset for like 30 epochs. So I would run the experiment for many more epochs.
Also, it doesn’t seem that the B5 model is showing much advantage here. I would recommend doing the experiments with B3 first and then when you get good results with B3 you can move to B5 and probably get a gain in QWK.

OK great thanks @ilovescience I will try that.

I’m just using 3 epochs at the moment as I experiment with searching for hyperparameters. And then yea, I’ll start increasing the number of epochs.

I’ll try train with b3 then scale to b5 and with more epochs on your suggestion.

Thanks for all your help! It’s kind of gone off topic now :slight_smile: But appreciate all the new tips to try!

1 Like

I agree it’s a good idea to use a few epochs to tune hyperparameters, but obviously that will have a lower QWK.

Yes, if you want, you can start a new thread if you have any other questions! Have fun!

1 Like

Hey @ilovescience I just had one other question. When you said to try b3 first, then move to b5, did you mean to transfer learn the b3 network to b5? Or did you mean to try them separately and see which performed better?

Yeah I meant try them separately.

1 Like

Ahh great, thanks! :slight_smile: