Another treat! Early access to Intro To Machine Learning videos

Awesome, thank you :slight_smile:

@miguel_perez I am reading the article you linked and man, does it not make sense :slight_smile:

What is categorical encoding? This is something specific to r? Does it work like if category in list of categories go right else go left?

Quite neat that one hot encoding seems to perform so poorly! One would think that doing numerical encoding should be worse - after all 1,2,3 etc seems to suggest some ordering where ordering might not exist… and yet. Ah the magic of trees :slight_smile:

Haha you got exactly the same question as me @radek :smiley: . I asked this question on the Kaggle noob Slack where you can find a #lauraedia channel where you can ask questions to Laurae (who’s the author of the article you linked). Here is a capture of the question I asked and the response if you don’t want to create a profile on kaggle noobs:

Hope we can both figure this out :sweat_smile:

1 Like

Thank you @ekami for sharing this! :slight_smile: So basically r does stuff, and that is all :slight_smile: Fine by me! At least now I am sort of starting to grasp why the encoding on categorical variables can be important / not easy :smiley:

Thanks @miguel_perez for sharing the article - it seems to have done my newb brain and @Ekami’s brain some good :slight_smile:

1 Like

I can confirm :sweat_smile:

Yeah this is something I spent too little time on. I spent the last 8 months working on DL and unstructured data (as @jeremy calls it) such as images and then I had to reach a point where I failed my first freelance mission in ML to realize there is a world outside DL and working purely with data (plotting/understanding them, feature engineering…) is even more important than being able to tune some knobs on models hyperparameters and doing architecture engineering (for DL). Now I understand when experienced Data scientists says: “We spend 90% of our time working on the data itself and 10% on model tuning”.

3 Likes

I think what I found out (not super sure this is correct):

  • what R does - split on up to 32 classes, basically at node keep a list, [these cats] -> go left, [these cats] -> go right
  • sklearn for now does one hot encoding which as evidenced in the article from @miguel_perez takes up a lot of depth to be split on (also, you can only split on one cat in one direction, rest in the other direction, so the split would have to be super significant for the tree to choose it early)
  • pandas supports categorical type
  • sklearn does one hot encoding of categorical variables (wonder if that is what happens with sklearn-pandas and us passing the categorical column with ordering to sklearn? not sure - maybe it use numerical encoding?
  • PR relatively close to being finished for sklearn to give sklearn the same ability as R has + extra superpowers for extra randomized trees (limitation of 32 categories for regular random forest - it requires 2**31 splits to be evaluated I think, with extra randomized trees we toss the coin on which categories to go left and on which right so that is computationally ok)

@radek, maybe to clarify a bit:

-I would not think about the topics in terms of “R” or "Python"or other frameworks. The way I thought about it, at least originally when posting was more general: how do tree based models handle categorical encoding (non numerical variables), trying to answer especifically two questions:
-1) What is best representation of the feature for the model? (to keep all available information of the variable, accuracy…)
-2) What is computationally more efficient?

So, you need to know what implementation you are using of tree based model, Xgboost LightGbm, or many others… mostly because some can handle categoricals as such (they handle equalities, they can say if category blue == to category blue) and others dont handle categoricals as such and have to encode as numbers. The fact that R encodes categories as numbers (factors) in first instance is just anecdote, cause this numbers are just the same as “label encoding” in Python, nothing really too clever.

All this “categorical encoding” thread is about is just what number to give the model when using ensembles of trees that dont handle equalities for categoricals. Hope a bit more clear

@Ekami, I’ve also learnt a lot from @laurae, he plays in another league but really helpful to us, relatively more noobs :grinning:

2 Likes

Do you guys realize “how much out my league” this thread must appear to beginners of Part 1 v2 ?
I mean discussing MS students lectures or @laurae “tutorials”, wtf :upside_down:

Deffo belongs to [adv] tag.

Lesson5. Another great lesson.

Top-down approach since lesson 1, all taking shape… and all useful.

Thanks for sharing it @Jeremy!

1 Like

That’s funny because here is the following discussion I had with Laurae:
Me:

I can’t imagine a feature from one programming language like R (the categorical variables) not existing for other languages like python. I mean, the category variable R has must be based on some “inner working” which can probably be ported to python (and which probably are already, we just need to find out where. Pandas category features maybe?).

Laurae:

@tuatini It’s not a matter of whether it exists or not, it’s a matter of whether someone tried to code it successfully or not. Most of the stuff is C++ based and nearly all machine learning in R is now done using C++, OpenCL, or CUDA code for years, with most of categorical implementations being backported to C++ instead of R. In Python, there’s still a lot of mixed stuff with Python / C++ / CUDA code (and by far not enough OpenCL also).
It’s also not a matter of how R handles it if it does not exist in Python, it’s a matter of why no one is trying to do it properly (properly = assume non-ordinal, near infinite categories per categorical feature) in Python (answer: handling categorical features for decision trees is slow unless you do it in C++, and even in C++ it is slow)

Me:

I see, thanks for these useful insights @Laurae! I believe this PR https://github.com/scikit-learn/scikit-learn/pull/4899 (opened on June 2015) illustrates what you are talking about :slightly_smiling_face:

1 Like

I don’t agree with Laurae’s comments here necessarily - if you want to dig deep into this, the best information by far is in this PhD thesis, by one of the sklearn core team: https://github.com/glouppe/phd-thesis/blob/master/thesis.pdf . It’s quite readable and really interesting, for more advanced students.

1 Like

Found this link quite helpful for a better insight into
Random Forests

1 Like

I was working on lesson 2 notebook which works with planet data. I was trying to download it directly into aws by using the curl method mentioned in the lesson1 of ML course. I found that planet data have most of files in .tar.7z format this zip. Any guidance on how to tweak command to download and extract it?

For downloading, you can use kaggle cli (there should be several mentions of it with detailed instructions in the forum).

For extracting, you can do the following:

First extract the 7z file by:
7z e train-jpg.tar.7z

This will give you train-jpg.tar file.

Then untar the file by:
tar xvf train-jpg.tar

Hope it helps!

1 Like

Thank you. What would be ‘e’ in first command and ‘xvf’ in 2nd?

For 7z command:
e is extract

For tar command:
x - extract files from archive
v – verbosely list files which are processed.
f – following is the archive file name

You can type in man 7z or man tar to see the command’s manual pages :slight_smile:

7 Likes

Hi @jeremy. I didn’t quite found a response to my question in lesson 5. I post it back here so maybe someone can answer it:

A little question regarding the machine learning lesson 3 video at 47:20 and lesson 5 at 22:12 where Jeremy say that: “Given the test set results from Kaggle and the results we have from our validation sets. We would like these 2 sets scores to be as linear as possible so that we know our validation set reflects the results we might have from the public leaderboard on Kaggle”.
So my question is: By doing so wouldn’t we be indirectly overfitting the public leaderboard? In other words: If we have a validation set which is pretty much the same as the Kaggle public leaderboard, how can we be sure that we are not indirectly tweaking our models to be good only at the public part of the leaderboard?

Thanks :slight_smile:

I think the idea is that the public leaderboard is a sample from the private leaderboard. You want your validation set to approximate your real life performance (since this is the only thing you can go by that is visible to your model), hence in real life you would like your validation set to be as representative of the real situations your model would encounter in the wild.

Here, real life is performing well on the leaderboard. You sort of go backwards - as you cannot sample from real life data (test set), you tune your validation set to as closely represent the public leaderboard as possbile.

You do not risk overfitting to the leaderboard as that is not the data your algorithm sees, but you sort of figure out how to construct your validation set (or more precisely, evaluate which validation set construction strategy is more useful) to as best represent the test set as it can.

Now, if the private leaderboard is not representative of the public leaderboard, than I don’t think anything can help you :slight_smile: But that would probably be a bit silly if that would be what kaggle was doing - it would be as if they were saying: hey, train on identifying these dog breeds and here you can see your score on the leaderboard (thus giving you feedback on your performance) and then saying - haha we will actually evaluate on this completely different set of data where dogs are wearing boots and hats. There is nothing preventing kaggle from doing this, but I think they have an interest in us doing well and the whole point is to get as good results as possible while doing so preserving sound methodology (so that the results are meaningful which is apparently not that easy due to leakage).

BTW I have very little experience with kaggle (hoping to change that :slight_smile: ) and all this are my musings so hard to say if I am right or wrong but maybe some of this will be helpful.

Also, appropriate picture for reference below - this thread is already awesome as is but there has never been a situation on the internet where a picture of a cat or a dog would not be able to make matters even better :slight_smile:

1 Like
  1. Why are we always trying to match our Validation
    set in terms of those of Kaggle’s private/test sets?

  2. Wouldn’t that mean we might be doing very good in the Competition but we couldn’t generalize our
    model?(might be a trivial one)

The generalization we care for in a kaggle competition is generalizing from our train set to the test set :slight_smile: Whether that is useful in real life or not is for kaggle to make sure when they devise the competition :slight_smile:

Any leakage or misalignment between the test set and intended real life scenarios will chip away at the real life usefulness of the model.