I came across a kaggle discussion on how we can check the similarity between train and test datasets. I have tried to replicate the idea in the below kernel. I have used Porto Seguro competition for this.
2 possible applications that I can think of:
- When we are sampling from our training data to create a validation set, we can compare that validation set to the test data for similarity or dissimilarity. This can be a good check if we are evaluating different models on a single validation
- The weights calculated can be used as sample_weights for any classifier to weigh in those observations which are closer to the test data. sample_weights is a parameter in classifiers like Random Forest, xgboost etc…
I encourage you guys to try this method on whichever competition you’re working on and let me know if it works.
I believe there are more ways that we can leverage this. Looking forward to some new ideas.