Order of layers in model

drscotthawley · February 15, 2020, 7:25pm

I realize this is an old thread, but given that it appears near the top of Google results on such a topic, and that the above reply doesn’t even attempt to answer the question of ordering, I want to leave this here:

An important point is that monotonic activation functions commute with (max- or average-)pooling. This means that the order does not matter. So you might as well save some time and do the pooling first, thereby reducing the number of operations performed by the activation.

Same thing goes for batch norm…to an extent: Whether you put it before or after your activation is a matter of some opinion, but putting it before or after MaxPooling will make very little difference on the accuracy – yet will affect the speed.

Similarly for Dropout: it commutes with many activations such as ReLU and tanh – any function f for which f(0)=0 – so the order doesn’t matter. Doing Dropout before or after BN will make a small difference but for large layers (or not-too-much dropout) the different will be negligible. For large dropouts & small number of neurons,… you’ll see some variability on the ordering. Dropout before or after pooling? As you noted, usually it appears after pooling.

This commutivity (commutativity?) property is one reason why you’ll sometimes see layers ordered differently: because it may not affect the results. But it can affect execution time!

Note that BN and ReLU do not commute, and people’s choices seem to vary on which they do first. For more on that, see Sylvain’s reply on this related thread: Where should I place the batch normalization layer(s)?, where he notes that that FastAI default is to follow ResNet and do BN before ReLU.

But other authors will do differently. For example, in this post on why the idea that BN cures internal covariant shift is a myth, it’s noted that “it has been found in practice that applying batch norm after the activation yields better results.” For them, for their problem. Try reversing the order on your problem, and use whatever works best.

What about non-monotonic activations like Mish? I haven’t tried. Mish is still close to monotonic for most inputs, it just has that little “dip” to the left of zero, which will affect some results. My intution suggests that you could still put it after pooling and save time, but check with those who do this.

EDIT: By the way, these and other things I learned from a great post that Jeremy once shared: https://myrtle.ai/how-to-train-your-resnet-8-bag-of-tricks/