Lesson 7 - Official topic

steef · April 29, 2020, 3:58am

Jeremy’s final words during today’s course: Next week NLP & Computer Vision

victor.vargas · April 29, 2020, 3:58am

Thanks Jeremy and Rachel

ilovescience · April 29, 2020, 3:58am

Thanks for another great lecture! Interesting to learn more about some “traditional” ML techniques like random forests!

steef · April 29, 2020, 3:58am

Hahaha - yes I have

Excited it’s coming next week

MaheshKhatri · April 29, 2020, 3:59am

Thanks Jeremy, Rachel & Sylvain.

ganesh.bhat · April 29, 2020, 4:00am

Can someone please point to what entity embedding mean from the last bit of the lesson? Is it the one hot encoding that we discussed or something else?

username_not_found · April 29, 2020, 4:00am

Thanks, see you all next week!

jcatanza · April 29, 2020, 4:02am

The enhancements that make Random Forest such a powerful Decision Tree model are:

Bootstrap sampling, which is selecting a random subset of the data (i.e. rows in the data table) to construct each decision tree, and
Selecting a random subset of the features (i.e columns in the data table) to make a ‘split’ at each ‘node’ in a decision tree.
Ensembling, which is contructing a group of models (in this case, a ‘forest’ of trees) and averaging their votes to make each final classification.

The first two of these enhancements are analogous in their application and their effect to the Dropout technique in Neural Networks.

muellerzr · April 29, 2020, 4:05am

For those interested in exploring said paper, I’ve opened up a thread here:

geoffpidcock · April 29, 2020, 4:11am

Thanks for the lesson Rachel, Jeremy, and team! As Kaggle seems to not have many open competitions using tabular data at present, I’m sharing the SIOP ML Competition as another place to find opportunities in putting chapter 9 through it’s paces!

zevarela · April 29, 2020, 4:12am

Thanks FraPochetti, I really liked your post!

barnacl · April 29, 2020, 4:34am

Thanks @FraPochetti

vijayabhaskar · April 29, 2020, 4:42am

Entity encoding is replacing categorical columns with learnable embeddings.

ganesh.bhat · April 29, 2020, 4:45am

In the notebook, the categorical columns were already replaced by numerical values when doing random forest and we used the same for NN. How do we get these values?

muellerzr · April 29, 2020, 4:46am

Also for more on that and where it came from @rachel wrote a wonderful article: Redirect

TL;DR? Took inspiration from Word2Vec

muellerzr · April 29, 2020, 4:48am

They’re relational to the particular cardinality of that variable. An easy example is say we have Jack, Queen, King, and Ace. Now we can’t simply pass that in as they’re not numbers! Instead let’s represent each separate option as an integer number. So we could say Jack = 0, Queen = 1, etc. etc. This scale up is done the same way and is the main method for utilizing categorical data

vijayabhaskar · April 29, 2020, 4:59am

The tabular_learner internally inserts the embeddings layer to the columns which you specified as categorical columns.

ganesh.bhat · April 29, 2020, 5:02am

Thanks @muellerzr and @vijayabhaskar

zevarela · April 29, 2020, 5:13am

Great lesson! Thanks and kudos to all the fast.ai team

jmete · April 29, 2020, 6:23am

Thanks for the lecture! This is probably one of the most practically useful lessons for myself, and I’m guessing a lot of others who work regularly with tabular data.