I’m interested in calculating my own image stats for normalization. I think this is worth trying because my images (medical) come from a different distribution than the CIFAR and ImageNet datasets. The values of these stats in fastai are:
Yes. Image pixels are divided by 255 when they are first loaded from disk, then they are normalized. (The 255 part is correct; I did not look at your notebook.)
However, see Jeremy’s argument that imagenet_stats, not the dataset stats, should be applied whenever using a model that was pretrained with imagenet.
I’m not convinced this argument applies when training on something as “unnatural” as medical images the likes of which the model has never seen. To me, it still makes sense to normalize those images to the same stats as the model saw after imagenet was normalized. However… in my actual experiments (with Kaggle Histopathological training data), normalizing by imagenet_stats vs. the full dataset stats made no appreciable difference to the model’s accuracy and ROCAUC. I would be very interested to hear what you discover.
BTW, you can just add .normalize(without parameters), and fastai automatically normalizes images per batch using mean and sd of the batch.
For the same competition I tried both and imagenet_stats worked better for me. The difference was not that much but imagenet_stats gave me a better rank.
I have about 1700 images and it’s already taking a LOT of time.
I"ve seen in another post the following implementation:
data_dir = "xxxxx" # your image directory
TRAIN = Path(data_dir)
images = (plt.imread(str(i)) for i in TRAIN.iterdir()) # generator comprehension
images = np.stack(images) # this takes time
means = np.mean(images, axis=(0, 1, 2))
stds = np.std(images, axis=(0, 1, 2))
I dont know if it is faster, but the problem is that my train directory contains subfolders of the classes, so (plt.imread(str(i)) for i in TRAIN.iterdir()) doesn’t work
That not split so src.train, you’d use src.train.x after splitting or you can do it from a databunch with data.train_ds.x. It defaults to collapsing the last 2 dimensions, as appropriate for image data. You can pass n_dims to change this and generate stats of arbitrary channels of arbitrary shapes.
Performance on the code above was [12396/12396 00:03<00:00] - so 3secs for ~12000 28x28 images (off a reasonably fast NVMe and having run before so probably largely cached, disk IO obviously a limit). Numerical stability seems good, it will be off from the true value by a bit but is stable (assert_allclose(rtol=0.001, atol=0.01) across 4000 batches of randn).