Calculating our own image stats (imagenet_stats, cifar_stats etc.)

JoshVarty · March 8, 2019, 7:35pm

I’m interested in calculating my own image stats for normalization. I think this is worth trying because my images (medical) come from a different distribution than the CIFAR and ImageNet datasets. The values of these stats in fastai are:

imagenet_stats: ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
cifar_stats: ([0.491, 0.482, 0.447], [0.247, 0.243, 0.261])
mnist_stats : ([0.15, 0.15, 0.15], [0.15, 0.15, 0.15])

If I’m reading this correctly, these are the means and standard deviations of each RGB channel.

I have tried to calculate my own channel means and standard deviations which result in:

([177.51475147222848, 138.6587881454333, 178.55802581840624], [54.9887577728556, 60.98731138460821, 71.97104127673265])

Obviously mine are not scaled properly. Do I just divide all the values by 255? That would give:

([0.6961362802832489, 0.5437599535115032, 0.7002275522290441], [0.21564218734453175, 0.23916592699846356,0.2822393775558143])

Am I calculating these correctly?

Pomo · March 8, 2019, 10:31pm

Yes. Image pixels are divided by 255 when they are first loaded from disk, then they are normalized. (The 255 part is correct; I did not look at your notebook.)

However, see Jeremy’s argument that imagenet_stats, not the dataset stats, should be applied whenever using a model that was pretrained with imagenet.

I’m not convinced this argument applies when training on something as “unnatural” as medical images the likes of which the model has never seen. To me, it still makes sense to normalize those images to the same stats as the model saw after imagenet was normalized. However… in my actual experiments (with Kaggle Histopathological training data), normalizing by imagenet_stats vs. the full dataset stats made no appreciable difference to the model’s accuracy and ROCAUC. I would be very interested to hear what you discover.

BTW, you can just add .normalize(without parameters), and fastai automatically normalizes images per batch using mean and sd of the batch.

rohit_gr · March 9, 2019, 12:15am

For the same competition I tried both and imagenet_stats worked better for me. The difference was not that much but imagenet_stats gave me a better rank.

menocineto · July 30, 2019, 11:59pm

Is there a faster way to do so?

I have about 1700 images and it’s already taking a LOT of time.

I"ve seen in another post the following implementation:

data_dir = "xxxxx" # your image directory
TRAIN = Path(data_dir)
images = (plt.imread(str(i)) for i in TRAIN.iterdir()) # generator comprehension
images = np.stack(images)  # this takes time 
means = np.mean(images, axis=(0, 1, 2))
stds = np.std(images, axis=(0, 1, 2))

I dont know if it is faster, but the problem is that my train directory contains subfolders of the classes, so (plt.imread(str(i)) for i in TRAIN.iterdir()) doesn’t work

TomB · September 26, 2019, 7:03am

Late reply but in case anyone else is looking for a solution to quickly collect statistics I’ve created some code to do this at https://gist.github.com/thomasbrandon/ad5b1218fc573c10ea4e1f0c63658469.
You can run it off any iterator that returns tensors. Something like:

>>> DATA = untar_data(URLs.MNIST_SAMPLE)
... src = (ImageList.from_folder(DATA)
...                 .split_by_folder(valid='valid'))
... stats = collect_stats(src.train)
... stats
RunningStatistics(n=9718464, mean=[0.128,0.128,0.128], std=[0.305,0.305,0.305])

That not split so src.train, you’d use src.train.x after splitting or you can do it from a databunch with data.train_ds.x. It defaults to collapsing the last 2 dimensions, as appropriate for image data. You can pass n_dims to change this and generate stats of arbitrary channels of arbitrary shapes.

Performance on the code above was [12396/12396 00:03<00:00] - so 3secs for ~12000 28x28 images (off a reasonably fast NVMe and having run before so probably largely cached, disk IO obviously a limit). Numerical stability seems good, it will be off from the true value by a bit but is stable (assert_allclose(rtol=0.001, atol=0.01) across 4000 batches of randn).