Creating this thread for discussion on possible topics for blogs and peer review.
I’m thinking to get started with an idea that I had shared as a kernel. It didn’t get much traction on Kaggle but I believe it’s worth sharing. Below is the link to the kernel. Any feedback will be appreciated.
You guys should definitely check this post by Tyler. It’ll definitely inspire you.
You can also go through some of the blogs that were discussed in the deep learning forum:
Thanks so much for sharing! It may be just me, but I found the math at the start made it hard to get in to, and then I didn’t really follow how you went from there to the rest of the post. Having said that, I’m not a strong mathematician, so I may not be the best person to comment on this kernel.
I have found however that outside of a university environment, you see a lot of code, and nearly no math, so possibly there’s a lot of people like me. If that’s the case, you may find a bigger audience if you spend more time explaining the context of what you’re doing, and where you use math to explain things, take us through the math more gradually and explain the relevance and meaning of each step.
Thanks for the feedback. Even I’m feeling that the math is not intuitive and making it difficult to comprehend. Will keep your suggestions in mind when converting the kernel into a post.
First blog (although it’s not public yet). I tried to write a blog about rf interpretation that we all studied in class.
Might be trivial stuff for us who have now learnt all this in ML class, but I thought it might be interesting for people out there.
I have attempted to explain the logic behind stuff using spreadsheet examples. (not sure if it helped or not)
Where you say in PDP we select a few “random” rows to change gives the wrong idea I think. Maybe better is to either remove the mention of random sampling from this section (it’s optional anyway), or else make it clear you randomly sample once at the start of the algorithm, and the averages are just taken from this random sample
Perhaps show a waterfall chart of tree interpreter contributions, and link to the most excellent waterfall package github repo?
Here is my attempt to convert my kernel into a blog. It talks about how similar or dissimilar our test and train data are and how can we detect that. Please provide your feedback.
@shik1470 it’s looking good! The thing that most stands out to me (as you can imagine!) is a lack of credit for your sources. I rather hope that at least some inspiration came from when we covered this in class - but you haven’t cited that at all. It is in everyone’s interest to cite and link as widely as you reasonably can, since every person you credit then has an incentive to share and promote your work.
Your descriptions are very clear, but I think it would be even more powerful if you took an example all the way through - show how this technique actually works in practice to improve some outcome. E.g. what does it actually show on the claims dataset you refer to?
Finally (and this is a matter of opinion, so feel free to ignore of course) I think memes detract from the credibility of a piece of writing, for at least some audiences (in particular, older audiences).
@jeremy Thanks for the suggestions. I admit it’s a mistake on my part that I didn’t give due credits and I should have. I’ll surely do that in my next draft. Also I’ll check whether the sample_weight method works on the insurance dataset. It’s something that I’m not sure of at this point. But the methodology to check for similarity is something that I wanted to highlight in the post. The sample_weight thing was something additional. But yeah still feels incomplete without a use case for sample weight.
Great - I really look forward to seeing how it goes. I guess the insurance dataset will only be a good example if the test set there isn’t randomly chose, and has some behavior that’s outside what’s in the training set. Do you know if that’s the case? If not, you’ll need to pick a different dataset to show off your method
I checked for similarity between test and train and it appears to be very similar…but the method of using sample weight for each row is still applicable as their are rows which are more similar to test data. I’ll try that if it improves the score…if not I’ll try it on another datatset…bulldozer can be an option also as it had some covariate shift