Another treat! Early access to Intro To Machine Learning videos

(Tuatini GODARD) #249

@gerardo I’m interested by this info but I can’t remember it. Do you still have the video “timestamp” where Jeremy speak about this? Thanks

(Tuatini GODARD) #250

To answer myself


This course is beyond amazing. I have come across random forests so many times in the past but I never would go past thinking “well, this is weird” and “I wonder how this can be useful at all”. After the first couple of lectures I went to - “wow, this is really cool!” and I now have a deep sense of understanding what they are about and how each piece fits together. Time to put this understanding to good use :slight_smile:

Given the impact that the fastai courses have on me, there is one question I really would like to ask. If I understand correctly, the second part of this course will be delivered by @yinterian? Is there a chance for it to also be recorded and any hopes for it being shared early like ml1? Obviously that is a lot to hope for but thought I’d ask :slight_smile:

Another related thought - we seem to now have a lot of covered in the fastai library:

  • deep learning
  • machine learning being covered
  • linear algebra that I am planning to getting to as soon as I get a bit of a breather from the amazing firehose of information I seem to be jacked up to atm :slight_smile:

One logical addition would be Statistics for Coders or The Bare Minimum of Stats You Need to Become Dangerous or something like that :wink: Plenty of good material out there that I will suffer through at some point for lack of a fastai alternative if need be but I sort of know how things go and there is something unique to the fastai courses where you get a deep sense for what you are dealing with (and you even get to use it, which seems like there might be a connection between the two :wink: ). I had multiple run ins with statistics at this point as well but I never seemed to go past where I got with random forests, which is not very far (despite all the time I sunk into it). And what I learned from @jeremy’s and @rachel’s advice on writing posts is that if I am in a boat, chances are that it is a big boat with many other folks in a similar situation :blush: Not that I am suggesting anything but just wanted to share this random thought that crossed my mind this morning in between of changing diapers!

(Tuatini GODARD) #252

one little question here regarding feature engineering: Do we consider ids (like person_id or item_id) to be categorical or continuous? I can’t get my head around these kind of features as they may be continuous (a new, incremented id is given for each new “row”) or they can be totally random. Thanks :slight_smile:

(Aditya) #253

Waiting for the next lesson desperately…

(Jeremy Howard) #254

Posted lesson 8 to top thread ( - currently uploading)

(Ahmet Ekin) #255

Thank you very much @jeremy, great to be able to follow this course alongside DL.

(sergii makarevych) #256

Oh, @rachel is on the lecture. Glad you are ok.

(yinterian) #258

Radek, I will make my jupyter notebooks and other material public. I do a lot in the whiteboard at the moment which is not a good setup for recording. Let me see what I can do.

(Jeremy Howard) #259

@yinterian FYI Terence Parr looked into using an iPad Pro for drawing on and projected to the laptop, and I believe he said it worked pretty well. Might be worth asking him about.

(Ryan Herr) #260

One logical addition would be Statistics for Coders or The Bare Minimum of Stats You Need to Become Dangerous or something like that

Here’s a helpful blog post by Julia Evans: Some good “Statistics for programmers” resources

(Jeremy Howard) #261

The answer is “it depends”. There’s a little clarification of this in the latest video. Let me know if you’d like more info.

(Tuatini GODARD) #262

I just found this tool passing by on KDnuggets which allow easy data wrangling/feature engineering on big data (can be very useful for the groceries competition). This looks awesome but is fairly new. (Sorry I didn’t know where to post that news)

(Dipjyoti Bisharad) #263

@jeremy I have a confusion… In last part of lesson 6 it was shown random forest cannot extrapolate time-series. So it means time-dependent variables are poor features while using random forests for time-series analysis.(and may be that’s why removing them in the helped improve accuracy to 0.92 in lesson 5).
But if that’s the case, then why yearMade showed up as the most important feature (in lesson 3) when initial random forest was built?

(Aditya) #264

What about Tableau?

(Jeremy Howard) #265

It shows that it was a strong feature in the training set. That doesn’t mean it’ll extrapolate well into the test set.

(Tuatini GODARD) #266

I didn’t know it could handle big data. But it’s not free for such datasets right?

(Dipjyoti Bisharad) #267

It shows that it was a strong feature in the training set. That doesn’t mean it’ll extrapolate well into the test set.

ok, please see if I got it correctly or not.
YearMade is a temporal feature and Random Forest reports it as the top feature because YearMade makes the best split in the training set data (sounds intuitive as well). But being a temporal feature, it doesn’t perform well on test set because it can’t extrapolate. So by removing it, the accuracy indeed improved.
The feature importance helped us to find the features which the RF thought was important as per training set. And then for the top features we checked if it was a temporal feature or not and then removed it. By doing so, we were left with only those features which aren’t temporal and then RF could create robust trees from them.
Is that correct?

(Jeremy Howard) #268

Yup that’s it :slight_smile:

(Tuatini GODARD) #269

@jeremy I’m trying to replicate what you did here on the groceries competition where basically you are trying to get a validation set as identical as possible to the test set.

So one question here: You mention that you retrain on the whole train set before submitting to Kaggle. So what time window do you use to do that? Lets say I want to try out with the last 2 weeks as my validation set so I train my model from 2017-01-01 to 2017-07-31 and get a val set score. Before submitting to Kaggle to get the public LB score I have to retrain on my whole train set, so do I train from 2017-01-01 to 2017-08-15 or from 2017-01-15 to 2017-08-15 (by moving the time window and keeping the same range of data) ?