Lesson 7 - Official topic

Hahaha - yes I have :slight_smile:

Excited it’s coming next week :slight_smile:

1 Like

Thanks Jeremy, Rachel & Sylvain.

1 Like

Can someone please point to what entity embedding mean from the last bit of the lesson? Is it the one hot encoding that we discussed or something else?

Thanks, see you all next week!

1 Like

The enhancements that make Random Forest such a powerful Decision Tree model are:

  1. Bootstrap sampling, which is selecting a random subset of the data (i.e. rows in the data table) to construct each decision tree, and
  2. Selecting a random subset of the features (i.e columns in the data table) to make a ‘split’ at each ‘node’ in a decision tree.
  3. Ensembling, which is contructing a group of models (in this case, a ‘forest’ of trees) and averaging their votes to make each final classification.

The first two of these enhancements are analogous in their application and their effect to the Dropout technique in Neural Networks.

1 Like

For those interested in exploring said paper, I’ve opened up a thread here:

https://forums.fast.ai/t/deep-learning-for-tabular-data-an-exploratory-study-by-jan-andre-marais/69938

2 Likes

Thanks for the lesson Rachel, Jeremy, and team! As Kaggle seems to not have many open competitions using tabular data at present, I’m sharing the SIOP ML Competition as another place to find opportunities in putting chapter 9 through it’s paces!

1 Like

Thanks FraPochetti, I really liked your post!

1 Like

Thanks @FraPochetti

1 Like

Entity encoding is replacing categorical columns with learnable embeddings.

1 Like

In the notebook, the categorical columns were already replaced by numerical values when doing random forest and we used the same for NN. How do we get these values?

Also for more on that and where it came from @rachel wrote a wonderful article: https://www.fast.ai/2018/04/29/categorical-embeddings/

TL;DR? Took inspiration from Word2Vec

2 Likes

They’re relational to the particular cardinality of that variable. An easy example is say we have Jack, Queen, King, and Ace. Now we can’t simply pass that in as they’re not numbers! Instead let’s represent each separate option as an integer number. So we could say Jack = 0, Queen = 1, etc. etc. This scale up is done the same way and is the main method for utilizing categorical data

The tabular_learner internally inserts the embeddings layer to the columns which you specified as categorical columns.

Thanks @muellerzr and @vijayabhaskar

1 Like

Great lesson! Thanks and kudos to all the fast.ai team

Thanks for the lecture! This is probably one of the most practically useful lessons for myself, and I’m guessing a lot of others who work regularly with tabular data.

1 Like

Is there a way to get the tabular learner to take a sliding window over rows of the dataset?
You could use that for predicting a time series like machine failure? https://www.kaggle.com/c/machine-failure-prediction/data

You can simply use what I call a “Time Step” and pass in those previous rows as an input. I did some work with this on movement identification with very good results. IE if we have a window of 3 and 8 variables one row is 24 variables. You’d need to rearrange the table probably but it does work :slight_smile:

2 Likes

Another option may be to use a 1d convolutional neural network to learn the most relevant filters (i.e., sliding windows). Since CNNs often reduce to dense layers at the end, you could even concatenate activations from the time-series CNN model with activations from the tabular model of the other metadata, or fashion it as a Siamese network.

1 Like