Test Train Similarity

I came across a kaggle discussion on how we can check the similarity between train and test datasets. I have tried to replicate the idea in the below kernel. I have used Porto Seguro competition for this.

Kernel: https://www.kaggle.com/shikhar1/train-test-similarity/notebook
Discussion: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/43453

2 possible applications that I can think of:

  • When we are sampling from our training data to create a validation set, we can compare that validation set to the test data for similarity or dissimilarity. This can be a good check if we are evaluating different models on a single validation
  • The weights calculated can be used as sample_weights for any classifier to weigh in those observations which are closer to the test data. sample_weights is a parameter in classifiers like Random Forest, xgboost etc…

I encourage you guys to try this method on whichever competition you’re working on and let me know if it works.
I believe there are more ways that we can leverage this. Looking forward to some new ideas.