Another treat! Early access to Intro To Machine Learning videos

Ekami · November 13, 2017, 10:16am

That make sense, thanks a lot!
So basically lets take a Kaggle competition example (the groceries competition). If the test set (leaderboard public + private) is said to be taken for 2 weeks after the the last day of the training set (e.g the train set would be from 2013-01-01 to 2017-08-15 and the test set would be from 2017-08-16 to 2017-08-31) we can easily assume that, as the order matter, the public leaderboard would be calculated for, say 2017-08-16 to 2017-08-22 and the private leaderboard from 2017-08-23 and 2017-08-31. So for our validation set we could replicate this pattern and take for example days range from 2017-08-01 and 2017-08-15

radek · November 13, 2017, 10:17am

I went through the whole document - was a very good read! Thank you for sharing

There is also this argument made with regards to proximities:

It follows that the values 1-prox(n,k) are squared distances in a Euclidean space of dimension not greater than the number of cases.

I think I understand the words but it is mind boggling if I read this correctly. This literally is saying that proximities do not approximate the distance, but that literally 1-prox(n,k) is a squared distance between the cases in some space?

Well, nothing too important whether I am reading this right or not - proximities are already such a neat trick - but if that was correct (and the words seem to be saying exactly that) that would be just wow.

radek · November 13, 2017, 10:26am

You might be right - that is a scheme that kaggle could be using. But I am not sure - if I were kaggle, and given my limited understanding of this, I would probably have the public leaderboard be a random sample from the private one. A sample that looks at all the days.

Yes, I think you are spot on wrt replicating the pattern That is assuming the last two weeks of the train set can mimic the test set quite well - as in they are not holiday season and the test set is, etc. There was also a comment made regarding the payday being important. I have not started on this competition (and it is going to finish soon ) but probably a good starting point for the construction of a good validation set would be figuring out how it was possible to get into top 30 with just the mean, what daterange that person was looking at and what else he did to the data.

Ekami · November 13, 2017, 10:28am

There is still plenty of time, there is 2 month left and I just started getting serious about this competition too .

Yep very good insights

jeremy · November 15, 2017, 4:47am

I’ve just added the lesson 6 video to the top post (uploading now - will be there in ~30 mins)

Ekami · November 15, 2017, 11:20am

I just fixed my Kaggle data downloader guys. Fortunately Jeremy told us about the changes of Kaggle-cli in the 3rd video of the DL course

miguel_perez · November 15, 2017, 3:11pm

@jeremy at the risk of giving you another “already better implemented tool” hint, there is a good interaction explainer for Xgboost It was originaly made by Faron Kaggle comps. master;

I used it in R but it’s ported to Python now, here https://github.com/limexp/xgbfir

I enjoyed the lesson a lot, thanks for sharing it!

gerardo · November 16, 2017, 3:28am

@jeremy

I think is the best lesson that includes real world examples
I think everyone should take a look to that lesson.

Ekami · November 16, 2017, 4:25pm

Agreed, this lesson is awesome!

@jeremy How many deep learning lessons like this do you plan to release? I know for deep learning there would be 7 but for ml? Again thanks a lot for doing this. I’m learning so much from you!

ecdrid · November 16, 2017, 4:47pm

What’s the total number of Lectures?

Just wanted to know in order to make space in my Brain

Seeing them makes me feel that i will pretty soon run out of Memory...

rsrivastava · November 17, 2017, 2:43am

Thank you so much Jeremy!! We appreciate all your help…

jeremy · November 17, 2017, 3:48am

I think 13 lectures for this one.

jeremy · November 17, 2017, 3:49am

I just posted lesson 7 to the top post.

Ekami · November 17, 2017, 3:24pm

@jeremy I didn’t quite understand this story between t-distribution and the number 22 in lesson 7 10:00. Could you give more details about this or give a link which explains in more details what this is about?

I don’t know if I’m alone but actually I found lesson 7 to be way more technical than the rest (in the way that there is a lot of technical mathematical terms/concepts that I wasn’t comfortable with to be able to follow and understand what you said).

Thanks

ramesh · November 17, 2017, 4:45pm

The way I understood is, if you ask a Statistician, how many observations you need before Central Limit Theorem is applicable, they might say 30. So it’s some large number but don’t need a lot of data before the method becomes applicable. I read it more as empirical observations. Would definitely love to hear more.

It also hit me that Variance of Binomial Dist.: np*(1-p) goes down as the number of observations goes up. Because p*(1-p) is always less than 1. It was not emphasized in the lecture, but after I thought about it for a while, it made lots of sense. Thanks for these ML Lectures.

jeremy · November 17, 2017, 5:09pm

That’s helpful feedback. Perhaps as a community we can try to explain some of the terms and concepts here, and then turn them into a web page? Could you let us know some of the terms or concepts you found tricky to understand?

miguel_perez · November 17, 2017, 8:12pm

Great lesson. Again. And yes, I also need a better brain machine for parts of this one (+ lots of Python skill)

@jeremy, something useful that I thought I had commented but I see now I had not: My prefered tool for random forests is… Xgboost. (Yes, I do mean random forests).Performance-wise it is probably the best implementation of random forests, in R at least.

To proxy a random forest with Xgboost one must use number of parallel trees > 1, plus some other minor tweaks. Only difference, it will not bootstrap rows but sample them. But can handle quite big datasets, and probably, (I haven’t used tree interpreter in rf mode yet, only boosting mode) opens the door to using the tree interpreter. And, anyway… very nice performing RF tool!

radek · November 17, 2017, 9:20pm

I suspect that this might not be easy to explain - if so, no worries and sorry to bother you - but would there be an easy way to go conceptually from random forest to gradient boosting? I tried researching xgboost but never got far and none of what I read really stuck with me nor made a lot of sense

But than I had the same experience with random forrests and I think I get them

miguel_perez · November 17, 2017, 9:39pm

@radek, Boosting is just another way of aggregating, usually trees (but could be other models).

Maybe good way to visualize it: A forest is “horizontal”. Boosting is “vertical”, you grow a tree, then you grow another one that will improve the residuals of the first one… Boosting is models correcting their mistakes sequentially. Don’t know if that makes you closer to intuition about it…

radek · November 17, 2017, 9:46pm

It does help - thank you