Calcuating the mean and standard deviation for normalize

JonathanSum · February 5, 2020, 5:43pm

I see there are some post here: Example of data.normalize for fastAI v1
Calculating New stats
Error in calculating mean and standard deviation of channels of image
But I still don’t understand the formula for it. Is it taking the number of images, except the channel? I don’t know.
I used the code to calculate it, but I guess my pictures’s size of not the same for training. It fails. Maybe I will use the same size or resizing it for dataloader. But I still don’t know how to write some code to calculate the mean and standard deviation.

my dataset:https://drive.google.com/open?id=1uYVjdra05Dpgcgmn1gSlAf8azVIeww_n
Can someone help me

LessW2020 · February 5, 2020, 8:55pm

Normalizing means getting the mean and std deviation for each channel
Thus you get something like this
imagenet_stats : ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
Note that you have to also divide by 255 as well.

LessW2020 · February 5, 2020, 8:57pm

Note theres more info and sample code here

JonathanSum · February 5, 2020, 10:43pm

Hi, I tried many code on the forum, but they all throw me an error that the size are not the same.

My dataset is so different than image-net and the MINST data-set. I have heard the lesson7 2018 talks about it, but I don’t see it.
What is the reason why divide by 255? Because we want to resize the images to be 255 for training?
I still don’t understand how to calculate it. I don’t even see the formula.
I have heard that we need to count all the number of images in the training set. Is this true?
Thank You.

juvian · February 5, 2020, 11:07pm

You should just be using .normalize() on your databunch for it to normalize according to your own images.

Because if you calculate manually the stats for pixels, they will be in 0-255 range but fastai later uses tensors in 0-1 range so you need to divide by 255.
you can check how its done in fastai. But its as simple as calculating mean and standard deviation over each channel of your images
Fastai uses a single batch to calculate stats, but iterating your whole databunch might give more accurate results

JonathanSum · February 6, 2020, 4:26pm

Hi, If I am going to resize it to be 224 at the dataloader, should I calcuate the mean and std after resize it before it?

juvian · February 6, 2020, 5:29pm

That’s a good question, not sure. You should try out both and see what works best ^^

jeremy · February 7, 2020, 4:42am

Can’t see why it would make any significant difference.

AjayStark · February 7, 2020, 3:22pm

Hi, while creating data for critic why are we using crappy and actual images?
Critic is supposed to classify between generated images and actual images and the generator will improve the generated images accordingly.
So in the that case, the critic data should contain gen images and actual images right?

Thanks,

kshitijpatil09 · April 2, 2020, 5:47pm

Should I divide sums of mean and std by no. of batches or total no. of examples?

I’m calculating stats of medical images. When I divide the sums by batch count, results are:

[0.6284, 0.5640, 0.6074]
[0.1892, 0.1842, 0.1885]

But when I divide the sum by total no. of examples, the results are:

[0.0196, 0.0176, 0.0190]
[0.0059, 0.0058, 0.0059]

I’ve divided all the image tensors by 255 before applying the normalization. When I compare the stats with imagenet,CIFAR,MNIST, the 2nd results here seem negligible.

Here’s my calculation code in fastai2:

fnames = get_image_files('data/TR-combined')
ds = Datasets(fnames, tfms=Pipeline([PILImage.create, Resize(320), ToTensor]))
dl = TfmdDL(ds, bs=32,after_batch=[IntToFloatTensor],drop_last=True)

mean, std = 0., 0.
for b in progress_bar(dl):
  mean += b[0].mean((0,2,3))
  std += b[0].std((0,2,3))

print((mean/nsamples))
print(std/nsamples)

JonathanSum · April 2, 2020, 5:57pm

Thank You, I will bookmark your solution in my solution. Thx for help again.

kshitijpatil09 · April 2, 2020, 7:23pm

Glad this helped

kshitijpatil09 · April 7, 2020, 7:50am

I guess the answer to this question is you should divide by no. of batches. If you look at this procedure, we’re already calculating mean over the batch which means you’ve divided the sum by batch size during this process.

Ultimately you should be dividing by the no. of samples but since it has been already divided by the batch size in the loop, the only thing left is to divide by no. of batches to complete the objective.