`min_samples_leaf` to be deprecated... please read

jonathanl · January 23, 2019, 1:55pm

Hi all,

I came across the GitHub issue below. Apparently scikit-learn docs did not properly describe the min_samples_leaf param in the DecisionTree... models, and, by extension, the RandomForest... models. Andreas points out that it does not do what everyone thinks it does. The param does not stop the tree growth when a leaf node has n or less samples (Lesson 2, around 1:22:27); rather it moves the split point to enforce that the leaf node contains n points. As such, it is kind of smoothing the fit, not pruning! The behaviour described in the lesson and the intuition of what is happening is not consistent with the implementation in the scikit-learn code. This documentation change commit message is pretty strong: “document that min_samples_leaf/min_weight_fraction_leaf are useless”.

In the spirit of the fastai philosophy of using what works and not getting too hung up on internals and statistical assumptions, a fair question is “Does this really matter since the models described by @jeremy in this class work well on train/validation/test?” Regardless of the answer to that question, it doesn’t seem like good form to rely on a parameter that is deprecated and will be removed in the future. The min_samples_split param remains and the issue thread implies that that implementation is “correct” and faithful to the literature.

-Jonathan

jeremy · January 23, 2019, 6:38pm

Andreas misunderstood the parameter, but we did not. With great help from @parrt we helped him understand why the parameter is important, and reverse the deprecation decision, in this issue:

jonathanl · January 23, 2019, 8:21pm

Wow, that’s fantastic. Fastai is the best . Thanks for pointing me to the updated issue.