I came across the GitHub issue below. Apparently scikit-learn docs did not properly describe the
min_samples_leaf param in the
DecisionTree... models, and, by extension, the
RandomForest... models. Andreas points out that it does not do what everyone thinks it does. The param does not stop the tree growth when a leaf node has
n or less samples (Lesson 2, around 1:22:27); rather it moves the split point to enforce that the leaf node contains
n points. As such, it is kind of smoothing the fit, not pruning! The behaviour described in the lesson and the intuition of what is happening is not consistent with the implementation in the scikit-learn code. This documentation change commit message is pretty strong: “document that
min_weight_fraction_leaf are useless”.
In the spirit of the fastai philosophy of using what works and not getting too hung up on internals and statistical assumptions, a fair question is “Does this really matter since the models described by @jeremy in this class work well on train/validation/test?” Regardless of the answer to that question, it doesn’t seem like good form to rely on a parameter that is deprecated and will be removed in the future. The
min_samples_split param remains and the issue thread implies that that implementation is “correct” and faithful to the literature.