That works for me - great article! I submitted a pull request to fix some small typos - I think it would be good to play about with the styling a little bit so the text is centered in the page and the tables are formatted a bit more clearly. I’m not familiar with fastpages but I’m sure there’ll be a way to configure that somewhere in the setup!
Thanks! I merged your pull request.
I agree with you on the formatting issues. Internally, it uses Jekyll which isn’t rendering well here on Git. On the markdown, it looks well, the tables come with borders and other formatting is intact but when shifted to git, it comes in raw format.
Hello! it looks like on the side_scorer function below:
def _side_score(side, y):
tot = side.sum()
if tot<=1: return 0
I am struggling to understand why multiplying the standard deviation by “tot” matters and why we are penalising the side_score if a lot of elements end up in one particular split.
I see that we of course normalise by dividing the final score by the length of the dataset but why did we choose to multiple in the first place
At 19:30 in regards to bagging and random forests, Jeremy says creating subsets by grabbing a random 50% will create uncorrelated models, but there must be a limit to this, right? At some point I’ll have grabbed all of the possible subsets and any additional subset I grab will be a copy of a previous subset, and thus have a correlation of 1 with another subset.
This is probably unlikely for any reasonably sized dataset, but I assume the intuition still holds if I by chance grab a subset where only one element is different. Slightly less for 2, 3, or 4 elements, and even less for 50, 100, etc. What does reasonably uncorrelated look like in practice? How big does the dataset have to be for the models to meet that bar?
Also, I now see that the get_tree function in the notebook uses random.choice(n, int(propn)) to get a prop sized bucket, but it does it with replacement. Would it be better or worse to pass random.choice(n, int(propn), False) to get samples without replacement? In my tests on the Titanic set, I think I’m getting better results, but is making the selection with replacement making the buckets more or less correlated?
If there it takes you 1000 subsets to encounter a situation, then the impact of that correlation=1 subset will be 0.1% on the result. YMMV IANAM**