BatchNormalization(axis = 1) when used on convolutional layers

(mattobrien415) #1

Hi everyone,

Jeremy snuck in another assignment for us, on the video for Lecture 3 here.

The question is, why do you need to specify axis = 1 as a parameter to the BatchNormalization method when using it in conjunction with a convolutional layer?

It looks like the reason is that we need to make sure that we do batch normalization on the channels themselves.

Apparently it is possible to do normalization along any dimension of the image!

So, if you set 1 as the value for the axis argument, then you are telling Keras will do batch normalization on the channels.

If you forget this, you would be instead using the argument -1. It’s not explicitly clear on the documentation what the argument -1 does – but I think it might go ahead and normalize your data by columns.

On the video Jeremy implies that this is a great opportunity to get a deep understanding of batch normalization – so maybe there’s more to the story?


Statefarm kaggle comp
(Jeremy Howard) #2

I’m so glad you took up this challenge - I’ve been hoping someone would have a look at this! :slight_smile:

The axis argument is the axis for which the mean and standard deviation is calculated. That is, it averages over every axis except that specified. So ‘axis=-1’ means to calculate the mean and stddev for each item in the last axis.

Your comments all look good, but I have a followup question for you: why only normalize axis=1? Why not normalize over all of axes 1, 2, and 3?

And if you want a really interesting question: what would it take to add batchnorm to an existing pre-trained model that was trained without it (like VGG)?

(Eric Perbos-Brinck) #3

I found this reply from François Chollet - Keras’ author- on GitHub.

It’s very simple:

For Dense layer, all RNN layers and most other types of layers, the default of axis=-1 is what you should use,
For Convolution2D layers with dim_ordering=“th” (the default), use axis=1,
For Convolution2D layers with dim_ordering=“tf”, use axis=-1 (i.e. the default).

This is slightly confusing: is “th” refering to Theano and “tf” to TensorFlow, thus axis sign is reversed based on the backend used by Keras ?

(Jeremy Howard) #4

It the different default dimension ordering. For theano, channel is the 2nd dim, for tensorflow, it’s the last dim.

(Edited with correction from @skottapa)

BatchNormalization axis for Convolution layers in Keras?
(Satish Kottapalli) #5


I thought it was the opposite. For Theano it is (batches,channels,width,height) while for Tensorflow it is (batches,width,height,channels).


(Corbin Albert) #6

You are correct, Skottapa


Combined with what EricPB mentioned, the dense layer should use axis=-1 which means calculating the mean and stddev for each item in the last axis (height for theano) and convolution layer should use axis=1 which means calculating the mean and stddev for each item in the 1st axis (channel for theano). What’s the purpose behind this? Is this the kind of conventional rule?

(Irshad Muhammad) #8

A great explanation guys. I would like to combine all of the points
We want to normalize the channel in image:
For Theano image, ordering is (Batch_Size, Channels, Width, Height)
For Tenforflow image, ordering is (Batch_size, Width, Height, Channels)
In Theano to normalize the channels, we will need to specify its axis which is 1.

(XiaokunXu) #9

Good Summary. I think the short description on keras documentation page “_axis: Integer, the axis that should be normalized (typically the features axis). For instance, after a Conv2D layer with data_format=“channels_first”, set axis=1 in BatchNormalization.” is not explicit enough.

After reading the paper [Arxiv] page 3 section 3, the author mentioned “for layer with d-dimentional input, we will normalize each dimension… where mu and sigma are computed over the training dataset (or batch)”. This is also what @jeremy said “average over every axis except the one specified by this argument”.

A concrete example, if a layer input is [100 x 16 x 16 x 3], meaning 100 example of a 16x16 pixels with three color channels (or other kernel channel), we will compute the mu and sigma for each of the 256 pixels (16x16), across the 100 example/pictures, in each of the 3 channels. In other words, the only dimension not being normalized is the color-channel. As @irshaduetian listed, in tensorflow, this is the last dimension, therefore we use axis=-1. in theano, this is the second dimension, we then use axis=1 (because python index is zero-based).