Effective Batch Size

rbunn80130 · March 31, 2020, 1:31am

For large images or for a small GPU the batch size may need to be much smaller than what is best for training. Does anyone know if there is any functionality that could create an “effective batch size” to compensate for this?

Pomo · March 31, 2020, 1:50am

You could run forward and backward passes on several sub-minibatches, accumulating gradients, and only then do optimizer.step(). I think it would be equivalent to using a single larger minibatch in the usual way.

rbunn80130 · March 31, 2020, 1:51am

Yeah, this is what I was thinking about doing if there doesn’t appear to be anything built in the library already.

marii · March 31, 2020, 2:00am

If it isn’t already then it probably will be at some point, such as in a future part 2 or 3. It was done last year in part 2 of the course(i think). If you are using a smaller gpu, then I think it is fine to use a smaller batch size until we get to that point. When you do multi-gpu training a lot of the time your are effectively doing batches across multiple gpus, so no one gpu has enough space for the entire batch size anyway. Getting the best of the best results generally requires very large batch sizes across multiple GPUs, though this is not needed in many cases.

muellerzr · March 31, 2020, 2:04am

In fastaiv1 there was a batch size finder, been itching to port it over to v2. Here’s the Medium article for anyone interested:

https://medium.com/@danielhuynh_48554/implementing-a-batch-size-finder-in-fastai-how-to-get-a-4x-speedup-with-better-generalization-813d686f6bdf?source=friends_link&sk=41c81410bfbd3eb4373cd50253397385

marii · March 31, 2020, 4:03am

Just to clarify, as that article is based on older information(2018). Isn’t it more acceptable to now do larger batch size training? (https://arxiv.org/pdf/1904.00962.pdf + last year’s fastai)

Just wanted to clarify as I think the first tweet in that article is no longer true from my understanding, and wouldn’t want to crush dreams and aspirations of the newer practitioners.

muellerzr · March 31, 2020, 4:09am

Yes it is, good catch. However if we’re exploring optimizers generally Ranger is the main one ported into the library which used RAdam + LookAhead and has a special fit function associated with it as well (fit_flat_cos). It was also used to get SOTA on ImageNette/Woof. Also LookAhead utilizes mini-batch training (not wanting to get too off topic).

I’ll also add you can still utilize a batch size finder even here as you’re still just looking at the losses over time and so it’ll find a larger batch size (this functions similar to lr_find()) so even with your mini batches I think it should still work. Otherwise you’d simply modify it to incorporate the optimizers mini-batches

rbunn80130 · March 31, 2020, 1:55pm

Just to make sure we are talking about the same thing, I’m not referring to what batch size is most effective for training. I’m saying that I can only fit a couple of images at a time into my GPU before exceeding GPU resources so I need the code to accumulate gradients before performing an update step. Any thoughts on where the best place is to make such a change?

miwojc · March 31, 2020, 2:54pm

Pytorch Lightning has Accumulate gratidents

Accumulated gradients runs K small batches of size N before doing a backwards pass. The effect is a large effective batch size of size KxN.

https://pytorch-lightning.readthedocs.io/en/latest/training_tricks.html#accumulate-gradients

I think fastai tried this approach but it had some side effects (source)

edit: added source

muellerzr · March 31, 2020, 2:58pm

Ah! Well there’s now an accumulate gradients callback!

https://dev.fast.ai/callback.training#GradientAccumulation

rbunn80130 · March 31, 2020, 3:49pm

Thanks! This is what I was looking for.

rbunn80130 · April 5, 2020, 1:39pm

When I use GradientAccumulation the training loss numbers I get are in the millions (and less than 1 when not using it), so not sure if it is currently working. Unless, there is some trick to using it properly. It seems like it should be extremely straightforward from the documentation. Any thoughts?