Starting this thread to discuss best practices. Came across an interesting article on potential problems you can run into with batch norm, in particular when your mini-batches don’t represent the distribution across your entire data set. Leaving here in case others find it useful.
5 Likes