And I guess what you suggest isn’t actually done…
(Jeremy explained this later)
Just think that what your model thought about a particular year(let’s say it the split point) in the training set,
It will be completely different to what it will be validated on…
That will do nothing beneficial but might make the model collapse…(especially if the size of the splits after the year are let’s say in the ratio 9:1 and our model will give wrong predictions)
Your concern is quite right, in a strict mathematical sense. For most real-world datasets (including this one) this won’t be an issue. If you do it at a more granular level however it can become an issue, and my friends Nina Zumel and John Mount have written an excellent paper and library about how to handle that situation if you’re interested: https://arxiv.org/abs/1611.09477
It’s always possible to do smarter feature engineering, but the trick is to know when it’s helpful and worth the investment of time. In this case, as you’ll see later in the course, creating a time difference variable doesn’t generally improve the predictive accuracy of a random forest, but can help with interpretation.
Jeremy, in Lecture 7, approximately at minute 17:20, you talk about what to do when you have an unbalanced dataset, and you refer to a paper that found that oversampling the less common class was the best approach. Do you remember which paper it was?
This is designed to be a standalone intro to machine learning - we shouldn’t be asking people to read other books to understand it! It sounds like we may need to add more information to the notebooks to help people interpret them.
Is low accuracy when you OHE all variables because each tree selects a random subset of the features? Each tree would have less information to learn from.