Another treat! Early access to Intro To Machine Learning videos

Ekami · November 20, 2017, 9:47pm

@gerardo I’m interested by this info but I can’t remember it. Do you still have the video “timestamp” where Jeremy speak about this? Thanks

Ekami · November 22, 2017, 10:31am

To answer myself

radek · November 22, 2017, 10:35am

This course is beyond amazing. I have come across random forests so many times in the past but I never would go past thinking “well, this is weird” and “I wonder how this can be useful at all”. After the first couple of lectures I went to - “wow, this is really cool!” and I now have a deep sense of understanding what they are about and how each piece fits together. Time to put this understanding to good use

Given the impact that the fastai courses have on me, there is one question I really would like to ask. If I understand correctly, the second part of this course will be delivered by @yinterian? Is there a chance for it to also be recorded and any hopes for it being shared early like ml1? Obviously that is a lot to hope for but thought I’d ask

Another related thought - we seem to now have a lot of covered in the fastai library:

deep learning
machine learning being covered
linear algebra that I am planning to getting to as soon as I get a bit of a breather from the amazing firehose of information I seem to be jacked up to atm

One logical addition would be Statistics for Coders or The Bare Minimum of Stats You Need to Become Dangerous or something like that Plenty of good material out there that I will suffer through at some point for lack of a fastai alternative if need be but I sort of know how things go and there is something unique to the fastai courses where you get a deep sense for what you are dealing with (and you even get to use it, which seems like there might be a connection between the two ). I had multiple run ins with statistics at this point as well but I never seemed to go past where I got with random forests, which is not very far (despite all the time I sunk into it). And what I learned from @jeremy’s and @rachel’s advice on writing posts is that if I am in a boat, chances are that it is a big boat with many other folks in a similar situation Not that I am suggesting anything but just wanted to share this random thought that crossed my mind this morning in between of changing diapers!

Ekami · November 22, 2017, 10:46am

one little question here regarding feature engineering: Do we consider ids (like person_id or item_id) to be categorical or continuous? I can’t get my head around these kind of features as they may be continuous (a new, incremented id is given for each new “row”) or they can be totally random. Thanks

ecdrid · November 22, 2017, 11:15am

Waiting for the next lesson desperately…

jeremy · November 22, 2017, 3:30pm

Posted lesson 8 to top thread (https://youtu.be/DzE0eSdy5Hk - currently uploading)

Ekin · November 22, 2017, 3:48pm

Thank you very much @jeremy, great to be able to follow this course alongside DL.

sermakarevich · November 22, 2017, 4:03pm

Oh, @rachel is on the lecture. Glad you are ok.

yinterian · November 22, 2017, 4:28pm

Radek, I will make my jupyter notebooks and other material public. I do a lot in the whiteboard at the moment which is not a good setup for recording. Let me see what I can do.

jeremy · November 22, 2017, 4:48pm

@yinterian FYI Terence Parr looked into using an iPad Pro for drawing on and projected to the laptop, and I believe he said it worked pretty well. Might be worth asking him about.

rrherr · November 22, 2017, 5:14pm

One logical addition would be Statistics for Coders or The Bare Minimum of Stats You Need to Become Dangerous or something like that

Here’s a helpful blog post by Julia Evans: Some good “Statistics for programmers” resources

jeremy · November 22, 2017, 5:43pm

The answer is “it depends”. There’s a little clarification of this in the latest video. Let me know if you’d like more info.

Ekami · November 25, 2017, 12:15am

I just found this tool passing by on KDnuggets which allow easy data wrangling/feature engineering on big data (can be very useful for the groceries competition). This looks awesome but is fairly new. (Sorry I didn’t know where to post that news)

dBisharad · November 25, 2017, 4:31pm

@jeremy I have a confusion… In last part of lesson 6 it was shown random forest cannot extrapolate time-series. So it means time-dependent variables are poor features while using random forests for time-series analysis.(and may be that’s why removing them in the helped improve accuracy to 0.92 in lesson 5).
But if that’s the case, then why yearMade showed up as the most important feature (in lesson 3) when initial random forest was built?

ecdrid · November 25, 2017, 9:09pm

What about Tableau?

https://www.tableau.com

jeremy · November 25, 2017, 9:18pm

It shows that it was a strong feature in the training set. That doesn’t mean it’ll extrapolate well into the test set.

Ekami · November 25, 2017, 9:39pm

I didn’t know it could handle big data. But it’s not free for such datasets right?

dBisharad · November 26, 2017, 4:54am

It shows that it was a strong feature in the training set. That doesn’t mean it’ll extrapolate well into the test set.

ok, please see if I got it correctly or not.
YearMade is a temporal feature and Random Forest reports it as the top feature because YearMade makes the best split in the training set data (sounds intuitive as well). But being a temporal feature, it doesn’t perform well on test set because it can’t extrapolate. So by removing it, the accuracy indeed improved.
The feature importance helped us to find the features which the RF thought was important as per training set. And then for the top features we checked if it was a temporal feature or not and then removed it. By doing so, we were left with only those features which aren’t temporal and then RF could create robust trees from them.
Is that correct?

jeremy · November 26, 2017, 4:59am

Yup that’s it

Ekami · November 26, 2017, 10:51am

@jeremy I’m trying to replicate what you did here on the groceries competition where basically you are trying to get a validation set as identical as possible to the test set.

So one question here: You mention that you retrain on the whole train set before submitting to Kaggle. So what time window do you use to do that? Lets say I want to try out with the last 2 weeks as my validation set so I train my model from 2017-01-01 to 2017-07-31 and get a val set score. Before submitting to Kaggle to get the public LB score I have to retrain on my whole train set, so do I train from 2017-01-01 to 2017-08-15 or from 2017-01-15 to 2017-08-15 (by moving the time window and keeping the same range of data) ?