I’m interested in calculating my own image stats for normalization. I think this is worth trying because my images (medical) come from a different distribution than the CIFAR and ImageNet datasets. The values of these stats in fastai are:
([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
([0.491, 0.482, 0.447], [0.247, 0.243, 0.261])
([0.15, 0.15, 0.15], [0.15, 0.15, 0.15])
If I’m reading this correctly, these are the means and standard deviations of each RGB channel.
I have tried to calculate my own channel means and standard deviations which result in:
([177.51475147222848, 138.6587881454333, 178.55802581840624], [54.9887577728556, 60.98731138460821, 71.97104127673265])
Obviously mine are not scaled properly. Do I just divide all the values by
255? That would give:
([0.6961362802832489, 0.5437599535115032, 0.7002275522290441], [0.21564218734453175, 0.23916592699846356,0.2822393775558143])
Am I calculating these correctly?
Yes. Image pixels are divided by 255 when they are first loaded from disk, then they are normalized. (The 255 part is correct; I did not look at your notebook.)
However, see Jeremy’s argument that imagenet_stats, not the dataset stats, should be applied whenever using a model that was pretrained with imagenet.
I’m not convinced this argument applies when training on something as “unnatural” as medical images the likes of which the model has never seen. To me, it still makes sense to normalize those images to the same stats as the model saw after imagenet was normalized. However… in my actual experiments (with Kaggle Histopathological training data), normalizing by imagenet_stats vs. the full dataset stats made no appreciable difference to the model’s accuracy and ROCAUC. I would be very interested to hear what you discover.
BTW, you can just add .normalize(without parameters), and fastai automatically normalizes images per batch using mean and sd of the batch.
For the same competition I tried both and imagenet_stats worked better for me. The difference was not that much but imagenet_stats gave me a better rank.
Is there a faster way to do so?
I have about 1700 images and it’s already taking a LOT of time.
I"ve seen in another post the following implementation:
data_dir = "xxxxx" # your image directory
TRAIN = Path(data_dir)
images = (plt.imread(str(i)) for i in TRAIN.iterdir()) # generator comprehension
images = np.stack(images) # this takes time
means = np.mean(images, axis=(0, 1, 2))
stds = np.std(images, axis=(0, 1, 2))
I dont know if it is faster, but the problem is that my train directory contains subfolders of the classes, so
(plt.imread(str(i)) for i in TRAIN.iterdir()) doesn’t work
Late reply but in case anyone else is looking for a solution to quickly collect statistics I’ve created some code to do this at https://gist.github.com/thomasbrandon/ad5b1218fc573c10ea4e1f0c63658469.
You can run it off any iterator that returns tensors. Something like:
>>> DATA = untar_data(URLs.MNIST_SAMPLE)
... src = (ImageList.from_folder(DATA)
... stats = collect_stats(src.train)
RunningStatistics(n=9718464, mean=[0.128,0.128,0.128], std=[0.305,0.305,0.305])
That not split so
src.train, you’d use
src.train.x after splitting or you can do it from a databunch with
data.train_ds.x. It defaults to collapsing the last 2 dimensions, as appropriate for image data. You can pass
n_dims to change this and generate stats of arbitrary channels of arbitrary shapes.
Performance on the code above was
[12396/12396 00:03<00:00] - so 3secs for ~12000 28x28 images (off a reasonably fast NVMe and having run before so probably largely cached, disk IO obviously a limit). Numerical stability seems good, it will be off from the true value by a bit but is stable (
assert_allclose(rtol=0.001, atol=0.01) across 4000 batches of randn).