Batch Size and Gradual Unfreezing

Wouldn’t it make sense to have a dynamic batch size when doing Gradual Unfreezing in Transfer Learning? If not I guess you just waste a lot of memory

learner.freeze_to(-1) = PARAMS["BS"] * 16

I am using fastai in combination with huge Transformer models and it makes a lot of sense to me. Why should I use the same batch size when fine-tuning a single linear layer vs. training an unfrozen RoBERTa Large ??

Would be nice if you could share your thoughts. I am running some experiments with this right now and will update this thread with the results.

One thing I could think of:
big batch size --> fewer updates per epoch (maybe noisy directions are corrected later?)
smaller batch size --> more updates per epoch (so direction can be corrected more often)

(A little bit like gradient descent vs. SGD.)

I guess this will also depend on the used optimizer?
When you increase the batch size do you adapt the learning rate too?
How big is the difference in the execution time?

1 Like

Thanks @MicPie
Have in mind that batch sizes become relatively small when using these giant transformers. I have to lower it all the way to 2 when fitting RoBERTa Large on my TitanXp . Interestingly things like the learning rate finder start to fail with a batch size that small.

I am still running the experiments. Once I have it I will share my code and results.

Ok, that’s already a small bs.

I am not used to training huge LM, but maybe they do use gradient accumulation if the bs is so small?