I am using Keras to train some simple MLP networks. I have seen case where large batch size prevents convergence even when I increase epochs. If you search on the web related to this, you will find most people claim the opposite but a few sources will support my finding.
The cause seems to be that large batch sizes can be stuck in local optima.
My question is perhaps I can decrease the learning rate and still use larger batch sizes for faster training. This is a problem I am exploring.
But I guess anything on this would be useful… Also I am not sure if the gradient is actually being smoothed in some way that prevents convergence or if it is the lack of noise or jumping around that causes it.
While decreasing the batch size does work, training with larger batches is faster and most sources claim one should train with as large as batch as possible. So my question is whether or not there is some other option to get the convergence rather then simply decreasing the batch size. I have tried increasing the epochs and decreasing the learning rate but decreasing the batch size for these simple MLP’s seem to work better in every case.