Same distribution for train and test sets

MahdiRezaei · July 26, 2019, 8:21am

Hi
when we say we want our train and test sets to have the same distribution what do we exactly mean ?

having about the same std and mean for both of them

or having about the same median and (25% , 75% …)

or something else ?

dhoa · July 26, 2019, 10:17am

Imagine you want to classify dog breeds, your can’t put your training set with color image of dog and black and white image in the test set. Or you want to predict house price, so your data in training set and test set should be in the same city.

Same std and mean is not enough, it also need to be in the same distribution (example Gaussian). Uniform distribution and Gaussian distribution even there have the same mean and std, they are not the same right ?

In brief, it just mean that your test set should be similar to your training set as much as possible.

MahdiRezaei · July 26, 2019, 1:11pm

sorry , i should have mentioned it earlier
i am doing mnist lesson (lesson 8 of ML for coders)
in this case they have the same distribution

but, How can i change the distribution of my validation set if they would not have the same distribution
is that even a right work to do ?

thanks for your reply @dhoa

MahdiRezaei · July 26, 2019, 1:32pm

should not we grab the pixel (for ex. row 2 column 2) for all the items of our data (50000 is here for training set photo below )
and sub mean and divide by std of them
and the same things for (ex. row 1 column 1) and to the end ?

MahdiRezaei · July 26, 2019, 1:58pm

what if our validation set had bigger numbers of RGB on average in comparision with training set

and for validation set by subtracting mean and dividing std (mean and std of training)
it would not turn into mean = zero and std = one (not here , in this case)
what should we do ?

dhoa · July 26, 2019, 2:51pm

I think we can not adjust the validation set mathematically like this. You need to choose carefully your validation set. It is normal that after normalization (based on the staticstic of the traininng set), your mean and std in your validation set is not zero and one (but should not be too different).

Validation set is the dataset that you use to double check that your model is doing a good job.Because sometime you fit too much your training set (even the noise) and you will not have a good prediction on the real data. So treat the validation set by the same process as the training set so you can track how your model works in the real data.