Lesson 7 - Official topic

I think # of leaves are determined by # trees and maximum depth of each tree, both of which are hyper-parameters to be tuned. Selecting when nodes are split is done by minimizing GINI (or maximizing information gain), and the leaf is also an instance of a node (just the last node in that tree i think).

what is the best way to identify bias in tabular data , and any suggested de-biasing techniques

1 Like

They do.
XGBoost has this functionality.
CatBoost is my favourite hands down though. It features very advanced tools around the issue you mention.

10 Likes

Check out AIF360 from IBM.
Here the online tool demo.

2 Likes

I did and didnā€™t workā€¦ I am also looking at the link suggested by @hiromi. Thank you both.

1 Like

Has anyone applied bagging/boosting to neural nets with good results?

When we do K-fold cross validation and use the K-fold ensemble for prediction, is that a form of bagging?

1 Like

is the concept of creating a minibatch in a DNN & tranining an epoch on them, analogous to creating a bagging based model ā€“ where random samples are created and trained ?

1 Like

Wouldnā€™t selecting random data end up selecting all data for training and therefore overfit on a Random Forest?

Again around CatBoost, this talk from Anna Veronika Dorogush (lead engineer at Yandex) is quite outstanding and introduces all the goodies of the package (avoiding overfitting included :smiley: )

8 Likes

Weird. Did work for me. Keep us posted.

2 Likes

https://arxiv.org/abs/2003.06505 See AutoGluon-Tabular by Amazon

Any similar resources for Lightgbm?

Not really, as you would not select all of it at the same time.
Also, you generally select random features in addition to random rows.
This emphasizes the fact that the single trees you are building are uncorrelated, hence driving the whole error to zero.

1 Like

That makes sense. Clever!

Again I see similarity with K-fold CV. OOB error seems similar to out-of-fold error.

1 Like

It is indeed!

Mario,

Look here:

Pay specific attention to the parts under API credentials. You need to open the terminal session in Jupyter and run thos export commands to create the env variables the Kaggle package requires.

Happy hunting!

2 Likes

what is the best method to understand and measure uncertainity in the predictions ?.

1 Like

Thank you @JPKabā€¦ and indeed happy hunting :slight_smile: