I came across this paper almost 1 year ago while working on a Kaggle competition, although I didn’t have the time to implement it.
I was wondering if anyone tried to compare its results against the 1cycle training regime. The researches at Google are claiming that a training regime where we increase the batch size during training …
[…] reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate ϵ and scaling the batch size B∝ϵ.
Using gradient accumulation this should be possible also for GPUs with a moderate amount of memory.
Intuitively, I think this makes sense and looks like a logical extension to the 1cycle policy. At every mini-batch increase the learning rate but also batch size so that you can take bigger steps with more confidence.
What are your thoughts?
P.S.: I need to find the time to implement this paper