Batch size and performance on P2 Instances

ccrome · January 4, 2017, 5:57am

Hi there,
I just thought the group might be interested in what I observed with different batch sizes. It seems that batch sizes of 64 or so are typical, but much larger batch sizes seem to fit in memory just fine, and and run much, much faster:

The first model.fit from the IMDB notebook: Batch size of 64 takes 7 seconds to train

Batch size of 6400 takes 1 second to train.

When epochs are 7 seconds each, it’s not an enourmous win (even throu it’s 7x faster), but with they are minutes each, the many times faster running is really helpful.

Basically, with small models, It seems you can have very large batch sizes given the 12GB of ram in the P2 GPUs and get really good performance.

jeremy · January 4, 2017, 6:51pm

The downside however is that there’s an equivalent decrease in the number of gradient updates that can be done in that time. If you’re careful about your optimizer details you may be able to benefit from the increased gradient stability and need far less gradient updates, but in general most research shows sizes >128 are less time efficient.

The specific example you showed is an interesting one - that’s basically just a linear model, so you can use large batches and high learning rates. I suspect you won’t find the same good results with deeper networks however.

Having said that, do give it a try, and report back on your findings!

ccrome · January 5, 2017, 2:05am

Interesting. Here are some quick results for running the single-hidden-layer network of lesson 5:

| batch size 	| wall time per epoch 	| lr    	| reaches  > 87% accuracy in N epochs 	|
|------------	|---------------------	|-------	|-------------------------------------	|
| 64         	| 5.8                 	| 0.001 	| 1                                   	|
| 128        	| 3.8                 	| 0.001 	| 1                                   	|
| 256        	| 2.3                 	| 0.01  	| 1                                   	|
| 512        	| 1.6                 	| 0.01  	| 2                                   	|
| 1024       	| 1.2                 	| 0.01  	| 3                                   	|

When I raise the learning rate to 0.1, learning stops all together. So, as you suspected, per epoch time goes down, but number of epochs seems to go up. So, as you predicted, 128 seems to be a pretty good sweet spot, at least for this model.

-Caleb