Papers : Batch size, learning rate, batch norm, generalizability

I wanted to share with you some interesting and related papers around batch size, learning rate, batch norm and generalizability :

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Don’t Decay the Learning Rate, Increase the Batch Size

Rethinking ImageNet Pre-training

How Does Batch Normalization Help Optimization?

Corollary food for thought:
Slow vs fast convergence : effect on final generalizability ?
Optimal batch size : single gpu (batch norm) vs multigpu (mean/sum of gradients) vs gradient accumulation benefit ?