Images "Normalization"

(Tuatini GODARD) #1

I use to divide RGB images by 255 to normalize them when it comes to CNN.
I saw that people use to subtract the mean pixel instead such as the approach taken by VGG ([103.939, 116.779, 123.68]).
I wanted to ask: When do we chose either normalization or mean subtraction? Is there any performance differences between the two?
And finally: Why would we ever want to do that instead of keeping the values of the images between 0 and 255? (as far as I know they are on the same “scale”)
I saw people asking for this over the forums but I didn’t find any decent answer.
Thanks for your help.

(Pietz) #2

there are three common techniques for value normalization:

  1. (x - x.min()) / (x.max() - x.min()) # values from 0 to 1
  2. 2*(x - x.min()) / (x.max() - x.min()) - 1 # values from -1 to 1
  3. (x - x.mean()) / x.std() # values from ? to ?, but mean at 0

you’re doing pretty much the first one, without the thought that values don’t necessarily need to start at 0. thats why subtracting the min is always a good idea. the second approach is very similar, only that it’s range centers at 0.

If VGG really does it this way, they are essentially doing the first part of the third technique, meaning that the mean is zero based. Dividing by the standard deviation afterwards is always a good idea to put your values on the same scale.

as far as i know the cleanest normalization is the 3., because its the only one that centers the mean at 0, which helps a lot with exploding or disappearing gradients. that being said, i’ve never found myself in the situation where using the 3. technique instead of the 1. has given me better performance.

(Ben Johnson) #3

Do you usually do the mean/standardization per channel or for the whole image?

Eg: like

(x - x.mean()) / x.std()

or like

x -= x.mean(axis=(1, 2), keepdims=True)
x /= x.std(axis=(1, 2), keepdims=True)

(Pietz) #4

To be honest I use the first method most of the time, but when I use the mean centering I’d do it by image. Channelwise normalisation can mess up the visual representation of an image and that’s always a pain in the butt for sanity checks.

Since BatchNorm became popular the normalization isn’t such a big deal IMHO. I just read a kaggle winner interview where even a Regression Problem that predicted values in the range of high hundreds didn’t benefit from normalization.

It may be a big deal for highly optimized and very deep CNNs though.

EDIT: well that was a load of horse crap. when doing the mean and standard deviation you want to do it on the entire training set and also use these mean and std values for the validation and test set. sorry about that.


So, you didn’t answer this quesiton by @Ekami: And finally: Why would we ever want to do that instead of keeping the values of the images between 0 and 255? (as far as I know they are on the same “scale”). I also confuse about this question.

(Pietz) #6

i think the last paragraph in my initial statement answers this very precisely.

(Pietz) #7

but sure, i’ll elaborate. Normalization means 2 things:

  • Putting the data on the same scale (Scaling)
  • Balancing the data around a point (Centering)

Now you could only do one without the other, but both bring their own specific benefits.

  • Scaling improves convergence speed and accuracy
  • Centering fights vanishing and exploding gradients, while probably also increasing convergence speed and accuracy

You have to understand that a neural network does a crapload of computations during training. it was a huge problem in the past to keep this stable and not have weights going to zero or towards infinity. actually, its still a problem.


Thanks. Your explainment is great.

(Anand Saha) #9

As mentioned in cs231n regarding data pre-processing, as far as CNNs are concerned, Mean image subtraction is enough (which centres the values to 0). Normalization is not required, since all values have the same scale to begin with (0-255).

Mean subtraction (with implicit normalisation in images) helps in faster convergence. Batch Normalisation is an extension to it, where all layers receive normalised data instead of only the first one. Will recommend reading the cs231 link above.