Another treat! Early access to Intro To Machine Learning videos

I am looking at the code for add_datepart and I am confused about - What do these lines actually do?


if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
    fld_dtype = np.datetime64


df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9


I am trying my first Kaggle Competition after the first lesson. I chose House Prices prediction.

I have tried to replicate all that was done in the first lesson on machine learning, but when I got to fit my model, I got a ValueError: n_estimators must be an integer, got <class 'pandas.core.frame.DataFrame'>.

I don’t understand what I did wrong or what more I should have done: shouldn’t the train_cats function take care of strings in the dataframe and convert them all automatically to numeric ?

You can verify the data types of your df by doing df.dtypes() and see if there are any non-numeric and non-categorical values.

1 Like

Without seeing your code, it seems like you’re passing the pandas dataframe to the RandomForestRegressor constructor (the first argument is n_estimators and it expects an integer, but you’re giving it a dataframe). Remember that you first create the model and only with, y_train) you fit the model.


In lesson 7, minute 17 a paper regarding resampling of unbalanced classes in training sets is mentioned. Could someone please help me find that paper (authors, title, or link)?

Thanks in advance for your help, and for this amazing set of lessons.

I think I found it in this other thread How to duplicate training examples to handle class imbalance.

Copying the link in case it is of interest for anyone else


Hi Utkarsh,

As I understood it its a fastai function which has been used to update the learning rate for optimisation of weights and biases,

I think min_sample_split is required to make sure you perform a split only if samples/rows/objects are greater than or equal to min_sample_split at current node. If any less is there then the split won’t happen. So if The max_depth is None and you have specified min_samples_split, the tree is not going to grow any further if the node contains less than min_samples_split samples/rows.

I second @jpramos reply. Try to name the parameters you are trying to send and hopefully it will resolve.

Hey Sashank,
I think it would be very simple to use categorical codes for random forests/ensemble/tree models and explain the results. If we use categorical codes in linear regression/logistic regression, the values assigned to categorical codes value may cause interference in building models. So wherever you are using decision/rule based models its OK to use categorical codes or one hot encoded variable. But when you are using anything other than rule/decision based algorithm I would suggest to use One Hot Encoding.


I finished my first Kaggle competition and got a surprisingly good result for the dataset the first time I ran the model.

However, when I tried to separate the set in a training and a validation set, I have had worse results.

These are the results for the whole set:

[0.06640609395146409, 0.059749794949568814, 0.9724835235033371, 0.9731294015377561]

And those are the scores for the training and validation tests (I used a validation set of 43 rows, it’s roughly 0.02 percent of the whole set):

[0.06867925757638324, 0.15292615297037612, 0.9705674334979815, 0.8239775636830122]

By separating the set in a validation set and a training set, I fell well behind in the leaderboard (like in the 75%).
While using the whole set (I hope I didn’t make any mistake), it got me to the first place on the leaderboard.

Why is there so much difference in the predictions? Is it because my training set is too small to divide it up into two sets?

What’s the conclusion? Should you only divide your set when it is large enough?


I recently started the ML part 1, and for that purpose created a GPU enabled GCP instance. But I realize that the course (at least the lesson 1) is only using the instance’s CPU.
What setting should I tweak in order to get the lesson 1 notebook execute cells using GPU ?

Also, following Jeremy’s request at the end of the lesson 1, I went to the first kaggle competition I could find and tried to prepare the data in order to run Random forest on it, but I’m facing a big issue:
While the video describes a dataset where each row has its own target value, the dataset I’m playing with has several rows per user, and the target to estimate is the log of the sum of a column for all rows grouped by user.

Now, to continue, I can only think of 2 options:

  • recreate a new dataset with only one row per user, data being merged/averaged/etc (feels like we’ll lose information this way)
  • create a new column logTotalRevenue on each row, containing the correct target value (I can have this done although it’s an extremely slow function). But it feels like random forest cannot work this way.

Can someone give me some pointers on the proper way to apply random forests on this competition ?

What do you mean by results from whole set? If you try scoring the whole set you will have only 1 RMSE and score value.

When you have two values that means there are two data sets - train and validation.

Without the code it is difficult to know what you did but if you are running jeremy’s code as-is then the second result happens after you sample the data. Jeremy had taken out 30k for faster processing.

Knowing how to clean/create your datasets for optimum models is one of the things we need to learn by doing multiple competitions. Jeremy covers some of these things in lesson 2 and 3.

But, if you really want to start on this competition you might want to think over why Jeremy had taken log of sale price in his example? My take is - It really simplified the explanation and the python notebook.

In reality, the log price is required only when we need to calculate RMSLE of our validation set. So instead of using the RMSE function:

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

where x and y are log prices.

You can actually do:

def rmsle(x,y): return math.sqrt(((np.log1p(x)-np.log1p(y))**2).mean())

where x and y are normal (no log) prices.

Similarly in the Google competition you need log of user transaction but only during validation. So, run your rf as-is without logs. And when you need to check your predictions compare the log (sum of user transactions) in test and validation sets.

1 Like

Has anybody else encountered this error??

you can check out this article by Rachel on the validation sets

You mean first place when submitted results to Kaggle? Then congrats well done!
if not by submitting your results and only comparing RMSE, then need to be careful to have good validation set otherwise you may have result that is overiftting the training set.

Most likely it’s overfitting to training set. Try hyperparameter tuning to reduce overfitting.

I just completed Lesson 2 and I am curious about the subsampling section of the discussion. Given that bootstrapping already subsamples the data why do we need a separate subsampling? Especially given that Jeremy clarifies that bootstrapping doesn’t work if we use the set_rf_samples function.

My understanding is that subsampling is that it is just another way of getting our model to run faster and test.

Is this the error that you are getting? ValueError: Number of features of the model must match the input. Model n_features is 250 and input n_features is 80

If so, then you need to make sure you run one hot encoding on your test dataset. If you recall, you ran max_n_cats=7 which created additional columns in your train data set. Basically its telling you that the number of columns are not matching between your train and test dataset. Check if you dropped or added columns that are not in the test data set.

Finally if the above is still not correcting columns you can use .align() to match columns between two dataframes. Example bellow

`train_labels = train['TARGET']
train, test = train.align(test, join = 'inner', axis = 1)
train['TARGET'] = train_labels`

I was working on lesson2-rf-interpratation and found that there a few things I needed to do to get ggplot() to work.

First I needed to include… from ggplot import *

After that I encountered the error when trying to run ggplot()… ImportError: cannot import name 'Timestamp'
According to this SO post, it looks like this was caused due to the import statement thats outdated in ggplot source code. While I don’t know if they are the best possible fixes, following their recommended changes, did help me run ggplot successfully so just wanted to share this for others trying to run ggplot.