For large images or for a small GPU the batch size may need to be much smaller than what is best for training. Does anyone know if there is any functionality that could create an “effective batch size” to compensate for this?
You could run forward and backward passes on several sub-minibatches, accumulating gradients, and only then do optimizer.step(). I think it would be equivalent to using a single larger minibatch in the usual way.
Yeah, this is what I was thinking about doing if there doesn’t appear to be anything built in the library already.
If it isn’t already then it probably will be at some point, such as in a future part 2 or 3. It was done last year in part 2 of the course(i think). If you are using a smaller gpu, then I think it is fine to use a smaller batch size until we get to that point. When you do multi-gpu training a lot of the time your are effectively doing batches across multiple gpus, so no one gpu has enough space for the entire batch size anyway. Getting the best of the best results generally requires very large batch sizes across multiple GPUs, though this is not needed in many cases.
In fastaiv1 there was a batch size finder, been itching to port it over to v2. Here’s the Medium article for anyone interested:
Just to clarify, as that article is based on older information(2018). Isn’t it more acceptable to now do larger batch size training? (https://arxiv.org/pdf/1904.00962.pdf + last year’s fastai)
Just wanted to clarify as I think the first tweet in that article is no longer true from my understanding, and wouldn’t want to crush dreams and aspirations of the newer practitioners.
Yes it is, good catch. However if we’re exploring optimizers generally Ranger is the main one ported into the library which used RAdam + LookAhead and has a special fit function associated with it as well (fit_flat_cos). It was also used to get SOTA on ImageNette/Woof. Also LookAhead utilizes mini-batch training (not wanting to get too off topic).
I’ll also add you can still utilize a batch size finder even here as you’re still just looking at the losses over time and so it’ll find a larger batch size (this functions similar to lr_find()) so even with your mini batches I think it should still work. Otherwise you’d simply modify it to incorporate the optimizers mini-batches
Just to make sure we are talking about the same thing, I’m not referring to what batch size is most effective for training. I’m saying that I can only fit a couple of images at a time into my GPU before exceeding GPU resources so I need the code to accumulate gradients before performing an update step. Any thoughts on where the best place is to make such a change?
Pytorch Lightning has Accumulate gratidents
Accumulated gradients runs K small batches of size N before doing a backwards pass. The effect is a large effective batch size of size KxN.
I think fastai tried this approach but it had some side effects (source)
edit: added source
Ah! Well there’s now an accumulate gradients callback!
Thanks! This is what I was looking for.
When I use GradientAccumulation the training loss numbers I get are in the millions (and less than 1 when not using it), so not sure if it is currently working. Unless, there is some trick to using it properly. It seems like it should be extremely straightforward from the documentation. Any thoughts?