Jeremy’s final words during today’s course: Next week NLP & Computer Vision
Thanks Jeremy and Rachel
Thanks for another great lecture! Interesting to learn more about some “traditional” ML techniques like random forests!
Hahaha - yes I have
Excited it’s coming next week
Thanks Jeremy, Rachel & Sylvain.
Can someone please point to what entity embedding mean from the last bit of the lesson? Is it the one hot encoding that we discussed or something else?
Thanks, see you all next week!
The enhancements that make Random Forest
such a powerful Decision Tree
model are:
-
Bootstrap sampling
, which is selecting a random subset of the data (i.e. rows in the data table) to construct each decision tree, and - Selecting a random subset of the features (i.e columns in the data table) to make a ‘split’ at each ‘node’ in a decision tree.
-
Ensembling
, which is contructing a group of models (in this case, a ‘forest’ of trees) and averaging their votes to make each final classification.
The first two of these enhancements are analogous in their application and their effect to the Dropout
technique in Neural Networks
.
For those interested in exploring said paper, I’ve opened up a thread here:
Thanks for the lesson Rachel, Jeremy, and team! As Kaggle seems to not have many open competitions using tabular data at present, I’m sharing the SIOP ML Competition as another place to find opportunities in putting chapter 9 through it’s paces!
Thanks FraPochetti, I really liked your post!
Thanks @FraPochetti
Entity encoding is replacing categorical columns with learnable embeddings.
In the notebook, the categorical columns were already replaced by numerical values when doing random forest and we used the same for NN. How do we get these values?
Also for more on that and where it came from @rachel wrote a wonderful article: Redirect
TL;DR? Took inspiration from Word2Vec
They’re relational to the particular cardinality of that variable. An easy example is say we have Jack, Queen, King, and Ace. Now we can’t simply pass that in as they’re not numbers! Instead let’s represent each separate option as an integer number. So we could say Jack = 0, Queen = 1, etc. etc. This scale up is done the same way and is the main method for utilizing categorical data
The tabular_learner internally inserts the embeddings layer to the columns which you specified as categorical columns.
Thanks @muellerzr and @vijayabhaskar
Great lesson! Thanks and kudos to all the fast.ai team
Thanks for the lecture! This is probably one of the most practically useful lessons for myself, and I’m guessing a lot of others who work regularly with tabular data.