Lesson 4 rossman dataset training concept issue

Arindam · January 8, 2019, 7:21am

Hi, I have a question in Rossman part of this lesson. In the last step in the jupyter notebook, I see that there’re 2 attempts to fit the model. The first attempt is in the “Sample” section got the rmspe around 0.19 after the first epoch.

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

m.fit(lr, 3, metrics=[exp_rmspe])

[ 0. 0.02479 0.02205 0.19309 ]
[ 1. 0.02044 0.01751 0.18301]
[ 2. 0.01598 0.01571 0.17248]

then after that, in “All” section, the similar code is run again but the rmspe is much lower as you see it around 0.11.

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   0.04, 1, [1000,500], [0.001,0.01], y_range=y_range)
lr = 1e-3

m.fit(lr, 1, metrics=[exp_rmspe])

[ 0. 0.01456 0.01544 0.1148 ]

So I just wondering if this is because the model continues to train after the Sample section but m variable is reassigned in the All section then it couldn’t happen or it is just because that model just has better random number so it fits the data better after the first epoch of training.

Any help on this topic would be very much appreciated.

Thank you.

Buddhi · January 8, 2019, 8:57am

Hey mate :).

Those two sections are meant to be run separately. Sample section only uses a subset of the training data, hence the lower score after 1 epoch, while the section ‘All’ uses all the data for training (excluding validation set obviously)

Arindam · January 8, 2019, 10:02am

@Buddhi column is a subset of the all the columns.

columns = ["Date", "Store", "Promo", "StateHoliday", "SchoolHoliday"]

But here in this case the code almost seems same so I am unable to understand how we taking in account all of the training data and not the subset as the variable ‘columns’ is not altered in between the training.

Thank You for your time.

Buddhi · January 8, 2019, 10:39am

Hey, now sure if I’m understanding you correctly. In the case of taking a sample, what we do is only use a fraction off the entire data set including all the features (columns). So if you go through the code, you come across where Jeremy creates a subset:

idxs = get_cv_idxs(n, val_pct=150000/n)
joined_samp = joined.iloc[idxs].set_index(“Date”)
samp_size = len(joined_samp); samp_size

So, this code pretty much says create a new data set which contains 150000 randomly chosen rows from entire data set created which is called “joined”.

When you run the Sample section, you use this data set. When running the All section, you have to skip this part of the code as you want to use the entire data set.

In general practice, you usually use a subset of data when creating your model, as in this phase, you’ll be tweaking your hyper-parameters and constantly retraining the model, you wouldn’t want to do this with the entire data set as training takes longer with more data.

Once you’re satisfied with the model you’ve created, then go ahead and create a new model using the parameters you’ve picked previously and fit it to the entire data set to it, this is what Jeremy has done in the code.

Hope that helps

Arindam · January 8, 2019, 12:26pm

Okay,now i get it. I didn’t know that we had to skip the code to train on the full data.
Thank You.