Another treat! Early access to Intro To Machine Learning videos


(Carlos Crespo) #786

I think I found it in this other thread How to duplicate training examples to handle class imbalance.

Copying the link in case it is of interest for anyone else https://arxiv.org/pdf/1710.05381.pdf

Thanks.


(Ramesh Kumar Singh) #788

Hi Utkarsh,

As I understood it its a fastai function which has been used to update the learning rate for optimisation of weights and biases,


(Ramesh Kumar Singh) #789

I think min_sample_split is required to make sure you perform a split only if samples/rows/objects are greater than or equal to min_sample_split at current node. If any less is there then the split won’t happen. So if The max_depth is None and you have specified min_samples_split, the tree is not going to grow any further if the node contains less than min_samples_split samples/rows.


(Ramesh Kumar Singh) #790

I second @jpramos reply. Try to name the parameters you are trying to send and hopefully it will resolve.


(Ramesh Kumar Singh) #791

Hey Sashank,
I think it would be very simple to use categorical codes for random forests/ensemble/tree models and explain the results. If we use categorical codes in linear regression/logistic regression, the values assigned to categorical codes value may cause interference in building models. So wherever you are using decision/rule based models its OK to use categorical codes or one hot encoded variable. But when you are using anything other than rule/decision based algorithm I would suggest to use One Hot Encoding.


#792

Hello,

I finished my first Kaggle competition and got a surprisingly good result for the dataset the first time I ran the model.

However, when I tried to separate the set in a training and a validation set, I have had worse results.

These are the results for the whole set:

[0.06640609395146409, 0.059749794949568814, 0.9724835235033371, 0.9731294015377561]

And those are the scores for the training and validation tests (I used a validation set of 43 rows, it’s roughly 0.02 percent of the whole set):

[0.06867925757638324, 0.15292615297037612, 0.9705674334979815, 0.8239775636830122]

By separating the set in a validation set and a training set, I fell well behind in the leaderboard (like in the 75%).
While using the whole set (I hope I didn’t make any mistake), it got me to the first place on the leaderboard.

Why is there so much difference in the predictions? Is it because my training set is too small to divide it up into two sets?

What’s the conclusion? Should you only divide your set when it is large enough?


(Adrien Lemaire) #793

Hi,

I recently started the ML part 1, and for that purpose created a GPU enabled GCP instance. But I realize that the course (at least the lesson 1) is only using the instance’s CPU.
What setting should I tweak in order to get the lesson 1 notebook execute cells using GPU ?

Also, following Jeremy’s request at the end of the lesson 1, I went to the first kaggle competition I could find and tried to prepare the data in order to run Random forest on it, but I’m facing a big issue:
While the video describes a dataset where each row has its own target value, the dataset I’m playing with has several rows per user, and the target to estimate is the log of the sum of a column for all rows grouped by user.

Now, to continue, I can only think of 2 options:

  • recreate a new dataset with only one row per user, data being merged/averaged/etc (feels like we’ll lose information this way)
  • create a new column logTotalRevenue on each row, containing the correct target value (I can have this done although it’s an extremely slow function). But it feels like random forest cannot work this way.

Can someone give me some pointers on the proper way to apply random forests on this competition ?


#794

What do you mean by results from whole set? If you try scoring the whole set you will have only 1 RMSE and score value.

When you have two values that means there are two data sets - train and validation.

Without the code it is difficult to know what you did but if you are running jeremy’s code as-is then the second result happens after you sample the data. Jeremy had taken out 30k for faster processing.


#795

Knowing how to clean/create your datasets for optimum models is one of the things we need to learn by doing multiple competitions. Jeremy covers some of these things in lesson 2 and 3.

But, if you really want to start on this competition you might want to think over why Jeremy had taken log of sale price in his example? My take is - It really simplified the explanation and the python notebook.

In reality, the log price is required only when we need to calculate RMSLE of our validation set. So instead of using the RMSE function:

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

where x and y are log prices.

You can actually do:

def rmsle(x,y): return math.sqrt(((np.log1p(x)-np.log1p(y))**2).mean())

where x and y are normal (no log) prices.

Similarly in the Google competition you need log of user transaction but only during validation. So, run your rf as-is without logs. And when you need to check your predictions compare the log (sum of user transactions) in test and validation sets.


#796

Has anybody else encountered this error??


#798

you can check out this article by Rachel on the validation sets
http://www.fast.ai/2017/11/13/validation-sets/


#799

You mean first place when submitted results to Kaggle? Then congrats well done!
if not by submitting your results and only comparing RMSE, then need to be careful to have good validation set otherwise you may have result that is overiftting the training set.

Most likely it’s overfitting to training set. Try hyperparameter tuning to reduce overfitting.


#800

I just completed Lesson 2 and I am curious about the subsampling section of the discussion. Given that bootstrapping already subsamples the data why do we need a separate subsampling? Especially given that Jeremy clarifies that bootstrapping doesn’t work if we use the set_rf_samples function.

My understanding is that subsampling is that it is just another way of getting our model to run faster and test.


(Patrick Suzuki) #801

Is this the error that you are getting? ValueError: Number of features of the model must match the input. Model n_features is 250 and input n_features is 80

If so, then you need to make sure you run one hot encoding on your test dataset. If you recall, you ran max_n_cats=7 which created additional columns in your train data set. Basically its telling you that the number of columns are not matching between your train and test dataset. Check if you dropped or added columns that are not in the test data set.

Finally if the above is still not correcting columns you can use .align() to match columns between two dataframes. Example bellow

`train_labels = train['TARGET']
train, test = train.align(test, join = 'inner', axis = 1)
train['TARGET'] = train_labels`

(Patrick Suzuki) #802

I was working on lesson2-rf-interpratation and found that there a few things I needed to do to get ggplot() to work.

First I needed to include… from ggplot import *

After that I encountered the error when trying to run ggplot()… ImportError: cannot import name 'Timestamp'
According to this SO post, it looks like this was caused due to the import statement thats outdated in ggplot source code. While I don’t know if they are the best possible fixes, following their recommended changes, did help me run ggplot successfully so just wanted to share this for others trying to run ggplot.



(Patrick Suzuki) #803

Looks like we are working on the same stuff!

I read a tutorial for PDPbox and it seems like pdp.pdp_isolate() requires you to pass in the columns for the data frame. For whatever reason the notebook was missing this parameter. (Perhaps PDPbox got updated.) I passed in the additional parameter x.columns.values and it worked.

Specifically I did something like this… p = pdp.pdp_isolate(m, x, x.colums.value, feat)


#804

It worked! Thanks for the help!


(Wayne Nixalo) #805

Pandas SettingWithCopyWarning:

I had this issue and didn’t find it in a search – here for reference.

While working on a separate dataset, a lot of the methods in fastai.structured resulted in SettingWithCopyWarnings, which were confusing to me since everything ran smoothly during Jeremy’s lecture.

There are a lot of ways to trigger this warning, in my case the issue was that the processed dataframe I was using was not initialized explicitly as a copy of the raw dataframe.

So something like:

df = df_raw.drop[blahblah]

instead of:

df = df_raw.drop[blahblah].copy()

In the first case there’s no way for Pandas to know if df is a ‘view’ or a copy of df_raw, and fastai operations like add_datepart and proc_df (and others) will trigger SettingWithCopyWarnings.


A potentially very confusing warning; completely straightforward fix.


#806

In the first lesson, Jeremy says to stop watching the lessons directly from YouTube.

Okay, but he didn’t provide any links to the lessons on fast.ai. Where can I find the lessons he’s referring to?

Thank you.


#807

Could someone explain how to submit results to Kaggle?

It’s a pity that it wasn’t shown on the first lesson, how does one export the test set results to a csv file with the appropriate columns?

Is there a built-in function to achieve this?