Why do we normalize with imagenet stats when fine-tuning?

In fastbook Chapter 7 (pg 242) they discuss normalization and imagenet_stats. They say:

This means that when you distribute a model, you need to also distribute the statistics used for normalization, since anyone using it for inference, or transfer learning, will need to use the same statistics. By the same token, if you’re using a model that someone else has trained, make sure you find out what normalization statistics they used, and match them.

I don’t understand why we should use imagenet_stats as opposed to “The mean and standard deviation of the new dataset we’re fine-tuning on”.

My impression is that we want to finetune against 0-mean and 1-STD images since that’s what the original model was trained against. But using imagenet_stats does not necessarily give us 0-mean and 1-STD images, using the mean and STD of the new dataset would.

That said, my experience has been that it either doesn’t matter or that imagenet_stats gives me better results! What I don’t understand is why this is the case?

1 Like

It’s because we use a pretrained model. When doing so the stats you should be using are based on the pretrained weights of the model to match what it was trained on. So when training from scratch yes we should use new statistics, but when transfer learning we shouldn’t, as those pretrained weights were based on a particular dataset

Also the mean and std of the dataset aren’t usually 0 and 1. It’s data specific, so each mean and std of the dataset will always be slightly off (You may be getting that confused with how we want the weights to be in the model during training to keep it stable)

3 Likes

I think, maybe possibly, there is a communication gap here. When using or fine-tuning a pretrained model, it makes sense to adjust the new dataset’s statistics to match the statistics of the data used to train the original model. Josh seems to be saying that the same adjustments made to the original source data will not apply to the new data. That’s seems to be true; for example, the new images may come from a different camera. You would need to make different adjustments to your new data to have it match the training data of the pretrained model.

IOW, the new data should be adjusted to match the statistics of the transformed data used to train the original model. From this perspective, the statistics of unaltered imagenet itself would be irrelevant.

At least, that’s how I understand the issue. Feel free to make corrections.
:slightly_smiling_face:

3 Likes

Agreed (and what I tried to say up there at least :slight_smile: )

1 Like

So I think I’ve managed to convince myself to use imagenet_stats during fine-tuning for a couple of reasons. Some feel kind of hand-wavey but here’s what I’m thinking:

  1. We typically fine-tune on “natural images” that are somewhat similar to ImageNet (at least compared to medical images or artificially generated images). Since we are training against natural images, we would like stats that best capture the mean and standard deviation of pixels found in natural images. Imagenet has millions of images and their statistics are probably a better estimate of this mean and standard deviation than my little dataset of 10,000 natural images.

  2. When we pretrain on ImageNet (millions of images) and then fine-tune on our dataset (thousands of images), the bulk of training examples seen by our network were from ImageNet. If I had to choose one set of statistics to use, it seems like using the statistics that represent most of the images we’re training against would make sense.

Also the mean and std of the dataset aren’t usually 0 and 1. It’s data specific, so each mean and std of the dataset will always be slightly off (You may be getting that confused with how we want the weights to be in the model during training to keep it stable)

Yeah sorry, I didn’t mean that the mean and STD of the dataset are 0 and 1, I mean that after subtracting by the mean and dividing by the STD, the average mean and average STD will be 0 and 1.

1 Like

I am revisiting this topic because I recently fine-tuned a pretrained resnet with new data, and had to get this right.

Looking at the fastai code, it seems that
imagenet_stats = ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

are the means and standard deviations for Imagenet itself. These are applied as adjustments to make the training input have mean=0 and sd=1. (It is annoying that there is no documentation in the code that explains the intent and purpose of these numbers. If I have misinterpreted their meaning, please reply.)

In this light, Josh’s original point is correct, and the paragraph from Chapter 7 is confusing as written. I don’t see any reason to apply the Imagenet corrections to new data just because resnet was pretrained with Imagenet adjusted by them. The relevant fact is that the resnet was pretrained with mean=0 and sd=1. You could make an argument for applying imagenet_stats for single image inference when you have no clue about the population stats, or even for transfer learning when your dataset is small (waving hands wildly). But in most cases, I think you would want to adjust your training image dataset to mean=0, sd=1, ignoring imagenet_stats altogether.

Sorry if I am being repetitious, but I feel its an important point that needs to be clarified. Feel free to correct any of my own misunderstandings.

:slightly_smiling_face: