MNIST stats look wrong

TomB · September 26, 2019, 7:14am

Is it known that the mnist stats are wrong?

>>> mnist_stats
([0.15, 0.15, 0.15], [0.15, 0.15, 0.15])

Mean is repeated for std. I calculated the stats across the URLs.MNIST_SAMPLE training set and got:

mean=[0.128,0.128,0.128], std=[0.305,0.305,0.305]

This is close to the values in the PyTorch mnist example which uses (0.1307,0.3081).
Not sure about fixing them and possibly throwing off models, but maybe at least for v2.

Code I used to collect the stats is in Calculating our own image stats (imagenet_stats, cifar_stats etc.) as people were asking about collecting stats.

sgugger · September 26, 2019, 1:23pm

I don’t remember when they were computed and put there so it’s possible they are wrong. Note that MNIST SAMPLE only has two classes so it’s not the right dataset to compute the mean and std, it should be the whole training set of the real MNIST dataset.

TomB · September 26, 2019, 1:34pm

Looking at history it’s from a commit from Jeremy with the dataset and stats which haven’t been edited.
Yeah, quite true about the sample, just what I was working on. Should I calculate the correct values of the full dataset and submit a PR? Given the possible issues with people using the new ones with a model trained on the old I wasn’t sure.

sgugger · September 26, 2019, 6:26pm

Yes please. The model on MNIST train quickly anyway.