When retraining pretrained ImageNet classifers for semantic segmentation, I often see people normalize the input image with mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225]. However, my own experiments show there is not much difference in either the training time or final accuracy if the input image is not normalized at all. My question is has anybody done controlled experiments that show normalizing is better than not normalizing?
Not really an answer to your question, but making sure that you don’t get vanishing gradients is a lot more important. As long as the mean and std stay fairly consistent throughout your layers, (meaning you initialized them well) it matters a whole lot less what the std and mean of the input are, assuming they aren’t unreasonably small.
Have you looked at the std and mean of your layers?
I’m doing transfer learning. The only new layer is the final classifier. Does it matter to see the std and mean of the input to the final classifier in this case?
Well, the problem that looking at the std and the mean solves is this:
When you have a model with lots of layers, you want to make sure that you get good (i.e. not extremely small) gradients for ALL layers of the network. If i leave the weights and biases unchecked at the beginning of training it is possible that this happens: the weights are all on average about 0.2. That means that the inputs, as they ‘travel’ through the network, keep getting smaller by a degree of 0.2. For networks with many layers, it is possible that the inputs keep getting shrinked until practically nothing is left. Consequently the gradients are going to be small for a good bit of the training until the network gets out of that tough spot. This is a bad scenario, and your neural network (assuming you’re using an nn here) will be sad. You can fix this with a good initialization. Basically, the mean and standard deviation can be symptoms that something is wrong. They aren’t solutions. Trying to initialize things, such that you meet targets for those values is the real problem.
Anyways, I’m guessing you knew all of that. As you can see, it is critical for the beginning of training, but once the network is on the right track, it does its ‘own thing’. So, there is no need to regulate your pre-trained layers. However, you want to make sure your new layers don’t have this problem, so initializing them well and keeping an eye on those stats (std and mean) is important for them. Just make sure your standard deviation isn’t shrinking relative to the last pre-trained layer, and you should be fine.
Example: the outputs of my last pre-trained layer gives a std of 0.4. I want to make sure that the number isn’t shrinking unreasonably as inputs traverse your network. If you only have like, one layer, you’ll be fine.
Side-note: you mentioned this is transfer learning. You should make it a point to find out how the pre-trained model was trained. If input-normalization was used in its training, i would recommend doing the same. But it really shouldn’t matter that much.
Hope this helps!
Thanks. Your explanation helps a lot!
That’s really perceptive. Really articulately written, thank you for doing that.