I am working on the NYC Taxi Fare challenge on Kaggle and accidentally find out that small batch size can give us a better result. Because I think it is quite new and haven’t discussed in the forum yet then I let myself create a topic on it. (Most of topic about reducing batch size is because it is not fit in the memory)
I found the information here. It says that when we use Large-Batch, we go straight to the Sharp-Minimum then lead to poorer generalization. In contrast with Small-Batch, the direction is quite oscillating, then the final point used to stay in a flat-minimum, and it is good to the generalization (the explanation is similar to why we use SGDR).
My test with batch size 512 and 128 as below:
But we have an inconvenience with small batch-size that it is so slow for trainning. So i’m thinking on a method that we train the first epoch with a large batch size to quickly get to a ‘quite ok’ model. Then we continue with smaller batch size (I think it is similar to Jeremy approach for computer vision: we train firstly with low resolution image). Unfortunately, I have no idea how to do that, I can’t find the attribute batch_size in the learner.
Hope someone can give me comment about this idea and how to realize that.