Lesson 7 - Official topic

tanguyen14 · April 29, 2020, 2:45am

I think # of leaves are determined by # trees and maximum depth of each tree, both of which are hyper-parameters to be tuned. Selecting when nodes are split is done by minimizing GINI (or maximizing information gain), and the leaf is also an instance of a node (just the last node in that tree i think).

harikrishnanrajeev · April 29, 2020, 2:45am

what is the best way to identify bias in tabular data , and any suggested de-biasing techniques

FraPochetti · April 29, 2020, 2:47am

They do.
XGBoost has this functionality.
CatBoost is my favourite hands down though. It features very advanced tools around the issue you mention.

FraPochetti · April 29, 2020, 2:49am

Check out AIF360 from IBM.
Here the online tool demo.

mario_carrillo · April 29, 2020, 2:51am

I did and didn’t work… I am also looking at the link suggested by @hiromi. Thank you both.

dcooper01 · April 29, 2020, 2:52am

Has anyone applied bagging/boosting to neural nets with good results?

ilovescience · April 29, 2020, 2:52am

When we do K-fold cross validation and use the K-fold ensemble for prediction, is that a form of bagging?

pinaki · April 29, 2020, 2:52am

is the concept of creating a minibatch in a DNN & tranining an epoch on them, analogous to creating a bagging based model – where random samples are created and trained ?

gamino · April 29, 2020, 2:53am

Wouldn’t selecting random data end up selecting all data for training and therefore overfit on a Random Forest?

FraPochetti · April 29, 2020, 2:54am

Again around CatBoost, this talk from Anna Veronika Dorogush (lead engineer at Yandex) is quite outstanding and introduces all the goodies of the package (avoiding overfitting included )

FraPochetti · April 29, 2020, 2:55am

Weird. Did work for me. Keep us posted.

tanguyen14 · April 29, 2020, 2:56am

https://arxiv.org/abs/2003.06505 See AutoGluon-Tabular by Amazon

arunslb123 · April 29, 2020, 2:57am

Any similar resources for Lightgbm?

FraPochetti · April 29, 2020, 2:58am

Not really, as you would not select all of it at the same time.
Also, you generally select random features in addition to random rows.
This emphasizes the fact that the single trees you are building are uncorrelated, hence driving the whole error to zero.

gamino · April 29, 2020, 2:58am

That makes sense. Clever!

ilovescience · April 29, 2020, 2:59am

Again I see similarity with K-fold CV. OOB error seems similar to out-of-fold error.

FraPochetti · April 29, 2020, 3:00am

It is indeed!

JPKab · April 29, 2020, 3:02am

Mario,

Look here:

Pay specific attention to the parts under API credentials. You need to open the terminal session in Jupyter and run thos export commands to create the env variables the Kaggle package requires.

Happy hunting!

harikrishnanrajeev · April 29, 2020, 3:03am

what is the best method to understand and measure uncertainity in the predictions ?.

mario_carrillo · April 29, 2020, 3:04am

Thank you @JPKab… and indeed happy hunting