Same distribution for train and test sets

(Mahdi ) #1

Hi
when we say we want our train and test sets to have the same distribution what do we exactly mean ?

having about the same std and mean for both of them

or having about the same median and (25% , 75% …)

or something else ?

0 Likes

(Dien Hoa TRUONG) #2

Imagine you want to classify dog breeds, your can’t put your training set with color image of dog and black and white image in the test set. Or you want to predict house price, so your data in training set and test set should be in the same city.

Same std and mean is not enough, it also need to be in the same distribution (example Gaussian). Uniform distribution and Gaussian distribution even there have the same mean and std, they are not the same right ?

In brief, it just mean that your test set should be similar to your training set as much as possible.

0 Likes

(Mahdi ) #3

sorry , i should have mentioned it earlier
i am doing mnist lesson (lesson 8 of ML for coders)
in this case they have the same distribution

but, How can i change the distribution of my validation set if they would not have the same distribution
is that even a right work to do ?

thanks for your reply @dhoa

0 Likes

(Mahdi ) #4

should not we grab the pixel (for ex. row 2 column 2) for all the items of our data (50000 is here for training set photo below )
and sub mean and divide by std of them
and the same things for (ex. row 1 column 1) and to the end ?

0 Likes

(Mahdi ) #5

what if our validation set had bigger numbers of RGB on average in comparision with training set

and for validation set by subtracting mean and dividing std (mean and std of training)
it would not turn into mean = zero and std = one (not here , in this case)
what should we do ?

0 Likes

(Dien Hoa TRUONG) #6

I think we can not adjust the validation set mathematically like this. You need to choose carefully your validation set. It is normal that after normalization (based on the staticstic of the traininng set), your mean and std in your validation set is not zero and one (but should not be too different).

Validation set is the dataset that you use to double check that your model is doing a good job.Because sometime you fit too much your training set (even the noise) and you will not have a good prediction on the real data. So treat the validation set by the same process as the training set so you can track how your model works in the real data.

0 Likes