Hi all,
I came across the GitHub issue below. Apparently scikit-learn docs did not properly describe the min_samples_leaf
param in the DecisionTree...
models, and, by extension, the RandomForest...
models. Andreas points out that it does not do what everyone thinks it does. The param does not stop the tree growth when a leaf node has n
or less samples (Lesson 2, around 1:22:27); rather it moves the split point to enforce that the leaf node contains n
points. As such, it is kind of smoothing the fit, not pruning! The behaviour described in the lesson and the intuition of what is happening is not consistent with the implementation in the scikit-learn code. This documentation change commit message is pretty strong: “document that min_samples_leaf
/min_weight_fraction_leaf
are useless”.
In the spirit of the fastai philosophy of using what works and not getting too hung up on internals and statistical assumptions, a fair question is “Does this really matter since the models described by @jeremy in this class work well on train/validation/test?” Regardless of the answer to that question, it doesn’t seem like good form to rely on a parameter that is deprecated and will be removed in the future. The min_samples_split
param remains and the issue thread implies that that implementation is “correct” and faithful to the literature.
-Jonathan