Hi there,
I just thought the group might be interested in what I observed with different batch sizes. It seems that batch sizes of 64 or so are typical, but much larger batch sizes seem to fit in memory just fine, and and run much, much faster:
The first model.fit from the IMDB notebook: Batch size of 64 takes 7 seconds to train
When epochs are 7 seconds each, it’s not an enourmous win (even throu it’s 7x faster), but with they are minutes each, the many times faster running is really helpful.
Basically, with small models, It seems you can have very large batch sizes given the 12GB of ram in the P2 GPUs and get really good performance.
The downside however is that there’s an equivalent decrease in the number of gradient updates that can be done in that time. If you’re careful about your optimizer details you may be able to benefit from the increased gradient stability and need far less gradient updates, but in general most research shows sizes >128 are less time efficient.
The specific example you showed is an interesting one - that’s basically just a linear model, so you can use large batches and high learning rates. I suspect you won’t find the same good results with deeper networks however.
Having said that, do give it a try, and report back on your findings!
When I raise the learning rate to 0.1, learning stops all together. So, as you suspected, per epoch time goes down, but number of epochs seems to go up. So, as you predicted, 128 seems to be a pretty good sweet spot, at least for this model.