Papers Mentioned During Class

kcturgutlu · October 31, 2017, 8:43pm

I am creating this topic to collect papers(links) that were mentioned during ML1 course. For example if anyone finds the paper about bagging please add through replies. Thanks !

jeremy · November 1, 2017, 3:52am

kcturgutlu · November 13, 2017, 5:29am

Just leaving this here as an external source of info

groverpr · November 16, 2017, 10:05am

https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/

It’s not a link suggested during class, but I guess we could use this thread for sharing general useful links.

-> So it says that sklearn random forest does not give good results with categorical variables. It is written that categorical variables have to be one hot encoded which I think is not 100% right. Like we did in class, can’t we just treat them as continuous discrete (we can also order them if needed) ?

The article has shown a huge difference in scores when our dependent variable is dependent on some categorical independent variable (as compare to H2o implementation). H2o tree implementation outperforms sklearn’s.

Is this true? Any comments?

kcturgutlu · November 17, 2017, 7:15am

PDP and ICE plots

shik1470 · November 19, 2017, 10:43pm

Nice article on ensembling and why it works
https://mlwave.com/kaggle-ensembling-guide/#comment-306675

yyun2 · November 21, 2017, 12:53am

Hi guys,
This paper was not mentioned in class, but I think it is very helpful when it comes to understanding how a single decision tree, i.e., a building block a random forest, works.
Since this paper is related to DNA and proteins. Here is some biology background you may need when reading this paper:

DNA-double stranded ‘ATCG’ (codes) that encode your proteins
Protein coding region-region of DNA that leads to proteins
Non-coding region-region of DNA that does not lead to proteins

Hope you all find this helpful

decision_tree_gene.pdf (1.6 MB)