Another treat! Early access to Intro To Machine Learning videos


(abhik) #545

Thanks Jeremy, I think I misunderstood, so does it mean the set_rf_sampels following these lines of code (which does the split_vals to create train and valid ) only does sampling from the X_train data set , is that correct , if thats the case it makes sense to me.

df_trn, y_trn, nas = proc_df(df_raw, ‘SalePrice’)
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)

Thanks for taking the time to reply, appreciate it.


(Jeremy Howard) #546

It samples from whatever dataset you provide to the RF.


(sleepy) #547

this material is awesome, thanks!!!


(DILIP S) #548

Nice and subtle explanations. Thanks @ramesh.


(Matthew Krehbiel) #549

Hello@jeremy! I have two questions.

  1. I’m almost done with the DL course and was wondering what benefits there are to learning typical machine learning. Are there many types of datasets/problems where classic machine learning will out perform deep learning? If so, what are some examples? Basically, I’m trying to decide if I should take this course after I complete the DL one, or just continue studying DL.

  2. What are your thoughts on reinforcement learning? Do you have much experience with it? Any plans on teaching a course about it?

Thanks for all you do, love the teaching style!


(Rahul Pathak) #550

Hi @krehbiel21 I will try to answer first point by an example -

This is a snippet from paper FaceNet - https://arxiv.org/pdf/1503.03832.pdf

Our method uses a deep convolutional network trained
to directly optimize the embedding itself, rather than an intermediate
bottleneck layer as in previous deep learning
approaches
Once this embedding has been produced, then the aforementioned
tasks become straight-forward: face verification
simply involves thresholding the distance between the
two embeddings; recognition becomes a k-NN classification
problem; and clustering can be achieved using off-theshelf
techniques such as k-means or agglomerative clustering.

Here DL and ML is used together serving different purposes to produce a working solution


(Jeremy Howard) #551

You should definitely take this course next - nearly all the concepts are directly applicable to what you’ve learnt and will make you a better DL practitioners.

There’s still a lot of doubt about whether RL is actually doing anything useful. Random search is nearly just as good for many of the things it’s been used for. So I’m holding off teaching anything about this until we have some genuine best practices to teach.


(Jeremy Howard) #552

@parrt and friends have just written a new article on feature importance in random forests. Would love to get your feedback on this draft - let us know if anything is unclear, you spot any mistakes, etc.

Please don’t share on social media yet, until we’ve fixed up any little issues!


(Gabriel Fior) #553

@jeremy Hello Jeremy, I have a question regarding the fast.ai library.
So after I used train_cats into a dataframe df, such as:

train_cats(df_raw) 

I basically substitute the categorical variables by numerical variables.
As far as I understood, the previous categories contained in the dataframe are replaced in place but, if I want to retrieve them later, more precisely to substitute them back into the dataframe, there is no easy way of doing that, since I have no explicit mapping between numerical and categorical values.

Could you give a few pointers on how this can be achieved?

Thanks and congratulations on the excellent course!


(DILIP S) #554

Thanks @timlee


(Jeremy Howard) #555

They’re still there - take a look at the data frame and you’ll see them! (We do this in the lesson, in fact).

We do discuss this in some detail in the video, so maybe try re-watching them and see if you can answer your question - if so, come back here and let us know what you find. If not, tell us what you can about your understanding, and we’ll try to fill in the gaps.


(Raymond) #556

Thanks a lot…


(Gabriel Fior) #557

@jeremy Thanks Jeremy, indeed the categories remain present in the dataframe after using train_cats.
I still have trouble when I use the function proc_df, like this:

data = {"pet":["cat", "dog", "dog", "fish", "cat", "dog", "cat", "fish"],
       "children":[4., 6, 3, 3, 2, 3, 5, 4],
       "salary":[90, 24, 44, 27, 32, 59, 36, 27]}
df = pd.DataFrame(data)
train_cats(df)
x, y, nas = proc_df(df2, 'salary')
x.head()
index children pet
0 4.0 1
1 6.0 2
2 3.0 2
3 3.0 3
4 2.0 1

So my question here is: can I somehow get a mapping between the categorical values 1, 2 and 3 back to the original “pet” values (cat, dog, fish, respectively)?

Thanks again,
Gabriel.


(Jeremy Howard) #558

Yes they’re in series.cat.code


(Utkarsh Mishra) #559

In Lesson 1(Machine Learning) in the part where @jeremy is dealing with the strings in the data (https://youtu.be/CzdWqFTmn0Y?t=3535) (“low, medium, high” etc.) . I am trying the same process to convert strings to category by using dataframe.cat.set_categories() my entire data changes to Nan . And on further changing it to dataframe.cat.categories all the values changes to -1. I have attached the kaggle link to data train.csv and i am trying to change the ‘GarageType’ column to category.


(Kiran) #560

Great set of videos. It will be great to have a textbook as well for the course authored by Jeremy & Rachel.


#561

I am new to python and ML so pardon my ignorance. I was looking at the question raised during lesson 1 about automatically parsing the data and finding dates.

Now when I look at the pandas.read_csv documentation there is a infer_datetime_format parameters. When I try it using:

df_raw1 = pd.read_csv("Train.csv", low_memory=False,
                 infer_datetime_format=True)

It doesn’t parse the saledate correctly.

But, if I try:

df_raw1 = pd.read_csv("Train.csv", low_memory=False, parse_dates=["saledate"],
                 infer_datetime_format=True)

Saledata is parsed correctly and this is 10x faster than the default:

df_raw = pd.read_csv("Train.csv", low_memory=False, 
                 parse_dates=["saledate"])

Is there a reason to not use the infer_datetime_format ?


#562

Additionally, I have been trying out the stuff taught in Lesson 1 using the Bikesharing Demand dataset too:

I tried this on Google Colab and my results seem strange. Here’s my book on google drive:


#563

@jeremy Hi Jeremy I had a question on the bagging section of the RF notebook. My understanding is that we need to pass the n_estimator for the number of trees. So how does:

m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

result in exactly 10 trees later in the example?

Edit: My bad. Looked at the documentation. 10 is default.


(Jeremy Howard) #564

Please don’t at-mention me unless it’s a question that no-one else could answer.