@jeremy Thank you, all of this is incredibly helpful. I wanted to clarify something. Above you suggest applying BatchNorm before the dropout. I was going through the notebooks to try to find a pattern and understand where to use it:
In Lesson3.ipynb, when we first used Batchnorm, I can see that we applied it after the Dense layer and relu activation, but also after the dropout.
# You should put BatchNorm after Conv layers, and Dense layers as well
def get_bn_layers(p):
return [
MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
Flatten(),
Dense(4096, activation='relu'),
Dropout(p),
BatchNormalization(),
Dense(4096, activation='relu'),
Dropout(p),
BatchNormalization(),
Dense(1000, activation='softmax')
]
On the other hand, in mnist.ipynb, we apply it after the Dense layer with the nonlinear activation (relu) but before the dropout as you recommend in this thread.
def get_model_bn_do():
model = Sequential([
Lambda(norm_input, input_shape=(1,28,28)),
Convolution2D(32,3,3, activation='relu'),
BatchNormalization(axis=1),
Convolution2D(32,3,3, activation='relu'),
MaxPooling2D(),
BatchNormalization(axis=1),
Convolution2D(64,3,3, activation='relu'),
BatchNormalization(axis=1),
Convolution2D(64,3,3, activation='relu'),
MaxPooling2D(),
Flatten(),
BatchNormalization(),
Dense(512, activation='relu'),
BatchNormalization(),
Dropout(0.5),
Dense(10, activation='softmax')
])
model.compile(Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
What is the difference between the two? Is the Lesson3 usage (after dropout) an exception due to the pretrained VGG activations? Is it safe to use it always Before drop-out as a rule of thumb?
Note: Btw, I am not sure how reliable it is, but on this link, a couple of people recommended to use Batchnorm between the linear output and non-linear activation if I understand correctly. Any thoughts?