Another treat! Early access to Intro To Machine Learning videos


#795

Knowing how to clean/create your datasets for optimum models is one of the things we need to learn by doing multiple competitions. Jeremy covers some of these things in lesson 2 and 3.

But, if you really want to start on this competition you might want to think over why Jeremy had taken log of sale price in his example? My take is - It really simplified the explanation and the python notebook.

In reality, the log price is required only when we need to calculate RMSLE of our validation set. So instead of using the RMSE function:

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

where x and y are log prices.

You can actually do:

def rmsle(x,y): return math.sqrt(((np.log1p(x)-np.log1p(y))**2).mean())

where x and y are normal (no log) prices.

Similarly in the Google competition you need log of user transaction but only during validation. So, run your rf as-is without logs. And when you need to check your predictions compare the log (sum of user transactions) in test and validation sets.


#796

Has anybody else encountered this error??


#798

you can check out this article by Rachel on the validation sets
http://www.fast.ai/2017/11/13/validation-sets/


#799

You mean first place when submitted results to Kaggle? Then congrats well done!
if not by submitting your results and only comparing RMSE, then need to be careful to have good validation set otherwise you may have result that is overiftting the training set.

Most likely it’s overfitting to training set. Try hyperparameter tuning to reduce overfitting.


#800

I just completed Lesson 2 and I am curious about the subsampling section of the discussion. Given that bootstrapping already subsamples the data why do we need a separate subsampling? Especially given that Jeremy clarifies that bootstrapping doesn’t work if we use the set_rf_samples function.

My understanding is that subsampling is that it is just another way of getting our model to run faster and test.


(Patrick Suzuki) #801

Is this the error that you are getting? ValueError: Number of features of the model must match the input. Model n_features is 250 and input n_features is 80

If so, then you need to make sure you run one hot encoding on your test dataset. If you recall, you ran max_n_cats=7 which created additional columns in your train data set. Basically its telling you that the number of columns are not matching between your train and test dataset. Check if you dropped or added columns that are not in the test data set.

Finally if the above is still not correcting columns you can use .align() to match columns between two dataframes. Example bellow

`train_labels = train['TARGET']
train, test = train.align(test, join = 'inner', axis = 1)
train['TARGET'] = train_labels`

(Patrick Suzuki) #802

I was working on lesson2-rf-interpratation and found that there a few things I needed to do to get ggplot() to work.

First I needed to include… from ggplot import *

After that I encountered the error when trying to run ggplot()… ImportError: cannot import name 'Timestamp'
According to this SO post, it looks like this was caused due to the import statement thats outdated in ggplot source code. While I don’t know if they are the best possible fixes, following their recommended changes, did help me run ggplot successfully so just wanted to share this for others trying to run ggplot.



(Patrick Suzuki) #803

Looks like we are working on the same stuff!

I read a tutorial for PDPbox and it seems like pdp.pdp_isolate() requires you to pass in the columns for the data frame. For whatever reason the notebook was missing this parameter. (Perhaps PDPbox got updated.) I passed in the additional parameter x.columns.values and it worked.

Specifically I did something like this… p = pdp.pdp_isolate(m, x, x.colums.value, feat)


#804

It worked! Thanks for the help!


(Wayne Nixalo) #805

Pandas SettingWithCopyWarning:

I had this issue and didn’t find it in a search – here for reference.

While working on a separate dataset, a lot of the methods in fastai.structured resulted in SettingWithCopyWarnings, which were confusing to me since everything ran smoothly during Jeremy’s lecture.

There are a lot of ways to trigger this warning, in my case the issue was that the processed dataframe I was using was not initialized explicitly as a copy of the raw dataframe.

So something like:

df = df_raw.drop[blahblah]

instead of:

df = df_raw.drop[blahblah].copy()

In the first case there’s no way for Pandas to know if df is a ‘view’ or a copy of df_raw, and fastai operations like add_datepart and proc_df (and others) will trigger SettingWithCopyWarnings.


A potentially very confusing warning; completely straightforward fix.


#806

In the first lesson, Jeremy says to stop watching the lessons directly from YouTube.

Okay, but he didn’t provide any links to the lessons on fast.ai. Where can I find the lessons he’s referring to?

Thank you.


#807

Could someone explain how to submit results to Kaggle?

It’s a pity that it wasn’t shown on the first lesson, how does one export the test set results to a csv file with the appropriate columns?

Is there a built-in function to achieve this?


(antoine mercier) #808

My understanding is that Jeremy originally planned to make these videos official on the fast.ai website, but eventually he dropped that plan I guess. So I think the only way is to watch them from YouTube.


(antoine mercier) #809

See this link to lesson 3 of the dl1 course: 00:32:30 Create a submission to Kaggle


(Carlos Vouking) #810

Get them from Courses .


(hector) #811

@fastai1 These videos are supposed to be viewed as part of machine learning MOOC. This hasn’t been launched and not sure when it will be out. However early access to the videos is provided in in the interest of time i guess. Lets hope it will released shortly


(Jeremy Howard (Admin)) #812

You can now discuss these videos on the #ml1 category :slight_smile: We’ll be launching the website for the online course early next week.


#813

When you have several date columns in a dataframe, do you need to pass them all to the date_part function?

If so, is there way to iterate through each of the columns to find which are of the date datatype and convert them to codes?

I have tried this on two columns of my df, but I get an error in return:

for col in sliced_list:
    df_raw.col 
AttributeError: 'DataFrame' object has no attribute 'col'

Or else, if I try:

for col in sliced_list:
    df_raw.columns.col
AttributeError: 'Index' object has no attribute 'col'

Is there an easy way to iterate through a dataframe’s columns?


(Jeremy Howard (Admin)) #814

@fastai1 please use the #ml1 category for discussing this course now.


#815

Hey Jeremy, sorry but I have already posted many questions on the forum you indicated, and I seem to not get any response for several days. This page seemed much more reactive to me.

Where should I post exactly to get an answer?