Why not drop SalesID from training dataset

Upon studying the ML courses at fastai, I notice that Jeremy never dropped the SalesID column in the bulldozer dataset when I always though this kind of variables hold little value in training the algorithm. Then when we reached feature importance, I can see it the top 15. I need to know if it holds any value keeping such type of variables in the training dataset?

Well, think of it this way - sequential IDs may not in themselves be informative, but what if people who joined more recently act differently than those who have been long-time members? What if IDs are actually not random and rather deterministic (always starting with 000 for one location, 001 for another, etc). It’s often OK to drop, but I usually leave it in until I’m sure it’s not going to be predictive.

EDIT 8/12/2018 Here’s @jeremy explaining it better than I can. https://youtu.be/3jl2h9hSRvc?t=52m39s

3 Likes