It did not take too much time to realize that the first time I read the paper I barely looked at the result, and it was not the easiest paper to understand, as they do not give much details about the math behind, neither do they explain clearly their implementation.
I ran into a problem when trying to implement this : it seems that they assume that we have a multi GPU used to compute the statistics we are looking for.
I have tried three different ways to solve this problem but none seems to really work.
Actually I found the solution on my own, managed to implement the paper and did a little testing on Rossmann’s stores, and found a 4x speedup using a bs of 512 instead of 64 !
I wrote a Medium article about this to share the results, would be glad to hear it from you
Thank you ! It’s a bit tricky to use, as there is a lot of instability of the curve, as there is a lot of approximation. My take on this is to increase beta from 0.99 to 0.999 for example if the curve is too bumpy.
But please keep me updated
I think it works rather well on Tabular data at least, but will have to try with Images and NLP but you might run into CUDA Out of memory in any case if you use too big bs ^^
@DanyWin, could you encapsulate your work in a bs_finder type of callback and put it in a github repository ? That would make it much easier to test / reuse.