Papers Mentioned During Class

I am creating this topic to collect papers(links) that were mentioned during ML1 course. For example if anyone finds the paper about bagging please add through replies. Thanks !

1 Like
2 Likes

Just leaving this here as an external source of info :slight_smile:

https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/

It’s not a link suggested during class, but I guess we could use this thread for sharing general useful links.

-> So it says that sklearn random forest does not give good results with categorical variables. It is written that categorical variables have to be one hot encoded which I think is not 100% right. Like we did in class, can’t we just treat them as continuous discrete (we can also order them if needed) ?

The article has shown a huge difference in scores when our dependent variable is dependent on some categorical independent variable (as compare to H2o implementation). H2o tree implementation outperforms sklearn’s.

Is this true? Any comments?

PDP and ICE plots

Nice article on ensembling and why it works
https://mlwave.com/kaggle-ensembling-guide/#comment-306675

Hi guys,
This paper was not mentioned in class, but I think it is very helpful when it comes to understanding how a single decision tree, i.e., a building block a random forest, works.
Since this paper is related to DNA and proteins. Here is some biology background you may need when reading this paper:

DNA-double stranded ‘ATCG’ (codes) that encode your proteins
Protein coding region-region of DNA that leads to proteins
Non-coding region-region of DNA that does not lead to proteins

Hope you all find this helpful

decision_tree_gene.pdf (1.6 MB)