Another treat! Early access to Intro To Machine Learning videos

abhikbanerjee · March 25, 2018, 9:04pm

Thanks Jeremy, I think I misunderstood, so does it mean the set_rf_sampels following these lines of code (which does the split_vals to create train and valid ) only does sampling from the X_train data set , is that correct , if thats the case it makes sense to me.

df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’)
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)

Thanks for taking the time to reply, appreciate it.

jeremy · March 25, 2018, 10:50pm

It samples from whatever dataset you provide to the RF.

sleepy · March 25, 2018, 11:30pm

this material is awesome, thanks!!!

DILIPS · March 28, 2018, 2:54am

Nice and subtle explanations. Thanks @ramesh.

krehbiel21 · March 28, 2018, 6:08am

Hello@jeremy! I have two questions.

I’m almost done with the DL course and was wondering what benefits there are to learning typical machine learning. Are there many types of datasets/problems where classic machine learning will out perform deep learning? If so, what are some examples? Basically, I’m trying to decide if I should take this course after I complete the DL one, or just continue studying DL.
What are your thoughts on reinforcement learning? Do you have much experience with it? Any plans on teaching a course about it?

Thanks for all you do, love the teaching style!

rpathak · March 28, 2018, 6:42am

Hi @krehbiel21 I will try to answer first point by an example -

This is a snippet from paper FaceNet - https://arxiv.org/pdf/1503.03832.pdf

Our method uses a deep convolutional network trained
to directly optimize the embedding itself, rather than an intermediate
bottleneck layer as in previous deep learning
approaches

Once this embedding has been produced, then the aforementioned
tasks become straight-forward: face verification
simply involves thresholding the distance between the
two embeddings; recognition becomes a k-NN classification
problem; and clustering can be achieved using off-theshelf
techniques such as k-means or agglomerative clustering.

Here DL and ML is used together serving different purposes to produce a working solution

jeremy · March 28, 2018, 2:28pm

You should definitely take this course next - nearly all the concepts are directly applicable to what you’ve learnt and will make you a better DL practitioners.

There’s still a lot of doubt about whether RL is actually doing anything useful. Random search is nearly just as good for many of the things it’s been used for. So I’m holding off teaching anything about this until we have some genuine best practices to teach.

jeremy · March 28, 2018, 5:46pm

@parrt and friends have just written a new article on feature importance in random forests. Would love to get your feedback on this draft - let us know if anything is unclear, you spot any mistakes, etc.

Please don’t share on social media yet, until we’ve fixed up any little issues!

gabrielfior · March 28, 2018, 9:19pm

@jeremy Hello Jeremy, I have a question regarding the fast.ai library.
So after I used train_cats into a dataframe df, such as:

train_cats(df_raw)

I basically substitute the categorical variables by numerical variables.
As far as I understood, the previous categories contained in the dataframe are replaced in place but, if I want to retrieve them later, more precisely to substitute them back into the dataframe, there is no easy way of doing that, since I have no explicit mapping between numerical and categorical values.

Could you give a few pointers on how this can be achieved?

Thanks and congratulations on the excellent course!

DILIPS · March 29, 2018, 4:46am

Thanks @timlee

jeremy · March 29, 2018, 2:38pm

They’re still there - take a look at the data frame and you’ll see them! (We do this in the lesson, in fact).

We do discuss this in some detail in the video, so maybe try re-watching them and see if you can answer your question - if so, come back here and let us know what you find. If not, tell us what you can about your understanding, and we’ll try to fill in the gaps.

Raymond · March 31, 2018, 2:40pm

Thanks a lot…

gabrielfior · April 3, 2018, 10:05pm

@jeremy Thanks Jeremy, indeed the categories remain present in the dataframe after using train_cats.
I still have trouble when I use the function proc_df, like this:

data = {"pet":["cat", "dog", "dog", "fish", "cat", "dog", "cat", "fish"],
       "children":[4., 6, 3, 3, 2, 3, 5, 4],
       "salary":[90, 24, 44, 27, 32, 59, 36, 27]}
df = pd.DataFrame(data)
train_cats(df)
x, y, nas = proc_df(df2, 'salary')
x.head()

index	children	pet
0	4.0	1
1	6.0	2
2	3.0	2
3	3.0	3
4	2.0	1

So my question here is: can I somehow get a mapping between the categorical values 1, 2 and 3 back to the original “pet” values (cat, dog, fish, respectively)?

Thanks again,
Gabriel.

jeremy · April 3, 2018, 10:13pm

Yes they’re in series.cat.code

utksh · April 5, 2018, 2:41pm

In Lesson 1(Machine Learning) in the part where @jeremy is dealing with the strings in the data (https://youtu.be/CzdWqFTmn0Y?t=3535) (“low, medium, high” etc.) . I am trying the same process to convert strings to category by using dataframe.cat.set_categories() my entire data changes to Nan . And on further changing it to dataframe.cat.categories all the values changes to -1. I have attached the kaggle link to data train.csv and i am trying to change the ‘GarageType’ column to category.

efpm04013 · April 6, 2018, 12:59pm

Great set of videos. It will be great to have a textbook as well for the course authored by Jeremy & Rachel.

shaman786 · April 7, 2018, 4:38pm

I am new to python and ML so pardon my ignorance. I was looking at the question raised during lesson 1 about automatically parsing the data and finding dates.

Now when I look at the pandas.read_csv documentation there is a infer_datetime_format parameters. When I try it using:

df_raw1 = pd.read_csv("Train.csv", low_memory=False,
                 infer_datetime_format=True)

It doesn’t parse the saledate correctly.

But, if I try:

df_raw1 = pd.read_csv("Train.csv", low_memory=False, parse_dates=["saledate"],
                 infer_datetime_format=True)

Saledata is parsed correctly and this is 10x faster than the default:

df_raw = pd.read_csv("Train.csv", low_memory=False, 
                 parse_dates=["saledate"])

Is there a reason to not use the infer_datetime_format ?

shaman786 · April 8, 2018, 3:21pm

Additionally, I have been trying out the stuff taught in Lesson 1 using the Bikesharing Demand dataset too:

I tried this on Google Colab and my results seem strange. Here’s my book on google drive:

shaman786 · April 10, 2018, 2:39pm

@jeremy Hi Jeremy I had a question on the bagging section of the RF notebook. My understanding is that we need to pass the n_estimator for the number of trees. So how does:

m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

result in exactly 10 trees later in the example?

Edit: My bad. Looked at the documentation. 10 is default.

jeremy · April 10, 2018, 4:58pm

Please don’t at-mention me unless it’s a question that no-one else could answer.