Hahaha - yes I have
Excited it’s coming next week
Hahaha - yes I have
Excited it’s coming next week
Thanks Jeremy, Rachel & Sylvain.
Can someone please point to what entity embedding mean from the last bit of the lesson? Is it the one hot encoding that we discussed or something else?
Thanks, see you all next week!
The enhancements that make Random Forest
such a powerful Decision Tree
model are:
Bootstrap sampling
, which is selecting a random subset of the data (i.e. rows in the data table) to construct each decision tree, andEnsembling
, which is contructing a group of models (in this case, a ‘forest’ of trees) and averaging their votes to make each final classification.The first two of these enhancements are analogous in their application and their effect to the Dropout
technique in Neural Networks
.
For those interested in exploring said paper, I’ve opened up a thread here:
Thanks for the lesson Rachel, Jeremy, and team! As Kaggle seems to not have many open competitions using tabular data at present, I’m sharing the SIOP ML Competition as another place to find opportunities in putting chapter 9 through it’s paces!
Thanks FraPochetti, I really liked your post!
Entity encoding is replacing categorical columns with learnable embeddings.
In the notebook, the categorical columns were already replaced by numerical values when doing random forest and we used the same for NN. How do we get these values?
Also for more on that and where it came from @rachel wrote a wonderful article: https://www.fast.ai/2018/04/29/categorical-embeddings/
TL;DR? Took inspiration from Word2Vec
They’re relational to the particular cardinality of that variable. An easy example is say we have Jack, Queen, King, and Ace. Now we can’t simply pass that in as they’re not numbers! Instead let’s represent each separate option as an integer number. So we could say Jack = 0, Queen = 1, etc. etc. This scale up is done the same way and is the main method for utilizing categorical data
The tabular_learner internally inserts the embeddings layer to the columns which you specified as categorical columns.
Great lesson! Thanks and kudos to all the fast.ai team
Thanks for the lecture! This is probably one of the most practically useful lessons for myself, and I’m guessing a lot of others who work regularly with tabular data.
Is there a way to get the tabular learner to take a sliding window over rows of the dataset?
You could use that for predicting a time series like machine failure? https://www.kaggle.com/c/machine-failure-prediction/data
You can simply use what I call a “Time Step” and pass in those previous rows as an input. I did some work with this on movement identification with very good results. IE if we have a window of 3 and 8 variables one row is 24 variables. You’d need to rearrange the table probably but it does work
Another option may be to use a 1d convolutional neural network to learn the most relevant filters (i.e., sliding windows). Since CNNs often reduce to dense layers at the end, you could even concatenate activations from the time-series CNN model with activations from the tabular model of the other metadata, or fashion it as a Siamese network.