Jeremy snuck in another assignment for us, on the video for Lecture 3 here.
The question is, why do you need to specify axis = 1 as a parameter to the BatchNormalization method when using it in conjunction with a convolutional layer?
It looks like the reason is that we need to make sure that we do batch normalization on the channels themselves.
Apparently it is possible to do normalization along any dimension of the image!
So, if you set 1 as the value for the axis argument, then you are telling Keras will do batch normalization on the channels.
If you forget this, you would be instead using the argument -1. It’s not explicitly clear on the documentation what the argument -1 does – but I think it might go ahead and normalize your data by columns.
On the video Jeremy implies that this is a great opportunity to get a deep understanding of batch normalization – so maybe there’s more to the story?
I’m so glad you took up this challenge - I’ve been hoping someone would have a look at this!
The axis argument is the axis for which the mean and standard deviation is calculated. That is, it averages over every axis except that specified. So ‘axis=-1’ means to calculate the mean and stddev for each item in the last axis.
Your comments all look good, but I have a followup question for you: why only normalize axis=1? Why not normalize over all of axes 1, 2, and 3?
And if you want a really interesting question: what would it take to add batchnorm to an existing pre-trained model that was trained without it (like VGG)?
I found this reply from François Chollet - Keras’ author- on GitHub.
It’s very simple:
For Dense layer, all RNN layers and most other types of layers, the default of axis=-1 is what you should use,
For Convolution2D layers with dim_ordering=“th” (the default), use axis=1,
For Convolution2D layers with dim_ordering=“tf”, use axis=-1 (i.e. the default).
This is slightly confusing: is “th” refering to Theano and “tf” to TensorFlow, thus axis sign is reversed based on the backend used by Keras ?
It the different default dimension ordering. For theano, channel is the 2nd dim, for tensorflow, it’s the last dim.
(Edited with correction from @skottapa)
I thought it was the opposite. For Theano it is (batches,channels,width,height) while for Tensorflow it is (batches,width,height,channels).
You are correct, Skottapa
Combined with what EricPB mentioned, the dense layer should use axis=-1 which means calculating the mean and stddev for each item in the last axis (height for theano) and convolution layer should use axis=1 which means calculating the mean and stddev for each item in the 1st axis (channel for theano). What’s the purpose behind this? Is this the kind of conventional rule?
A great explanation guys. I would like to combine all of the points
We want to normalize the channel in image:
For Theano image, ordering is (Batch_Size, Channels, Width, Height)
For Tenforflow image, ordering is (Batch_size, Width, Height, Channels)
In Theano to normalize the channels, we will need to specify its axis which is 1.
Good Summary. I think the short description on keras documentation page “_axis: Integer, the axis that should be normalized (typically the features axis). For instance, after a Conv2D layer with data_format=“channels_first”, set axis=1 in BatchNormalization.” is not explicit enough.
After reading the paper [Arxiv] page 3 section 3, the author mentioned “for layer with d-dimentional input, we will normalize each dimension… where mu and sigma are computed over the training dataset (or batch)”. This is also what @jeremy said “average over every axis except the one specified by this argument”.
A concrete example, if a layer input is [100 x 16 x 16 x 3], meaning 100 example of a 16x16 pixels with three color channels (or other kernel channel), we will compute the mu and sigma for each of the 256 pixels (16x16), across the 100 example/pictures, in each of the 3 channels. In other words, the only dimension not being normalized is the color-channel. As @irshaduetian listed, in tensorflow, this is the last dimension, therefore we use axis=-1. in theano, this is the second dimension, we then use axis=1 (because python index is zero-based).
Which axis should I use if I have the following architecture?
input_shape: (30, 64, 64) # This means short video files of 30 gray-scale images. I manage the 30 gray-scale images as the channels of the network. This is a must for my problem.
I’m using generators with a batch size of 1024 due to memory problems, so each epoch is trained with (1024, 30, 64, 64) mini batches.
My question is: should I use the axis=1 in the two first BatchNormalization layers and axis=-1 in the last one?
Thank you in advance.
Think the batchnorm should be applied axis=1, since it’s where the channels are for the 1st two layers. My guess is once you flattened it and pass it through the dense layer, it would be a 2D tensor (bs, number of nodes in dense layer). So the batchnorm should still be applied axis =1?
But 30 channels!?
I’m still learning, but do people put a BatchNorm layer just before your very final activation layer ‘softplus’
Answer to your question could be to initialize the two trainable parameters such that answer with these new parameters in the batchnorm system is the same as without it. The two parameters can be seen to denormalize the batchnormalized data back.
My followup question is lets say you have a intermediate processed data of [batchsize, size1, size2, channels], are we trying to normalize on batchsize i.e. mean/variance is calculated on objects of size [size1, size2, channels] or are we trying to normalize objects of [batchsize,size1,size2]. What would be the axis value in tensorflow to do these two things i have mentioned