Rossman 2.0 Trying to Improve Results

I’ve seen lots of scattered discussions on the Rossmann notebook, mostly debugging and learning the basics.

I wanted to open a thread to see what people have tried in terms of improving on the result.

The first obvious step is correcting the mistake Jeremy highlighted where they dropped all rows where sales =0
This isn’t 100% straightforward because leaving those rows in the dataframe means you start running into dividing by 0 errors in the evaluation function that’s written.

Another step, very much related to the first but something I haven’t seen other people talk about is doing before and after feature engineering counters on the event of a store being open or not. The EDA on this dataset shows this is important so making it easier to see should help the model. I also wonder why the authors chose so few of their features to use. Prior to creating lists of cat_vars and contin_vars there were 95 features. Then the lists reduce that down to 38. Why not keep these and try L2 regularization or other approaches to preventing over fitting.

Once the fix to the sales = 0 issue has been implemented, it seems like the next best step is to follow the ML1 approach of running a random forest as fast as possible and doing EDA via random forests feature importance and seeing if there is additional valued-added feature engineering to be done. Additionally, as noted in the videos, it seems like passing in the categorical embeddings to the Random Forest increases it’s performance and therefore more EDA via random forest feature importance is warranted once the embeddings have been added to the data.

At this point, I was thinking of trying to implement resnet blocks or other architecture changes so that we could build a deeper network.

Any other ideas? I’d love to team up with someone on this. For better or worse I’m teaching myself to program due to my interest in Machine Learning and therefore can get much further ahead in ideas and theories than my program skills allow me to implement (at the moment).


As a side note, I continue to get a strange error when training the sample size network. I get the error and then training continues. This happens every time I run the notebook, haven’t been able to google an answer yet:

Hi Will,

I’m interested in working more on this data set. I just got the bugs I was hitting sorted out.

Did you get anywhere with these improvements?


Maybe this thread shouldn’t be in the 2017 forum?I

I’m not working on this data set at the moment, but here are a couple ideas:

  • using denoising auto encoders (DAE), inspired by the winner of the Porto Seguro Kaggle competition. Note that I haven’t seen anyone replicate his solution so far (I am currently trying and haven’t been able so far); so this is not going to be straightforward.

  • creating another model using LGBM. Tree based algorithms are currently beating NN’s in all the recent tabular data competitions. Would be nice to have a comparison with the Rossman data.


1 Like

I found out that using _yl = yl/max_log_y as in the original paper seems to give a better RMSPE, wiithout the accompanying overfitting.

max_log_y = np.max(yl)
#y_range = (0, max_log_y*1.2) #don’t use this line
yl = yl/max_log_y #use this ratio instead

But md.get_learner will need to be modified

m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
0.02, 1, [1000,500], [0.001,0.01]) #remove y_range

as well as pred_test

pred_test = pred_test*max_log_y #need to multiply it back
pred_test = np.exp(pred_test)

Here’s the training result

On sampled data, 5, metrics=[exp_rmspe], cycle_len=1)
epoch trn_loss val_loss exp_rmspe
0 0.000337 0.000216 0.014604
1 0.000289 0.000188 0.013627
2 0.000252 0.000169 0.01291
3 0.000225 0.000155 0.012359
4 0.000213 0.000139 0.011731, 1, metrics=[exp_rmspe], cycle_len=8
epoch trn_loss val_loss exp_rmspe
0 0.000276 0.000285 0.016703
1 0.000259 0.000164 0.012773
2 0.000215 0.000145 0.011999
3 0.000189 0.000169 0.013083
4 0.000162 0.000132 0.01143
5 0.000145 0.000124 0.011047
6 0.000134 0.00012 0.010845
7 0.000128 0.000109 0.010369

lr = lr/10, 1, metrics=[exp_rmspe], cycle_len=4)
epoch trn_loss val_loss exp_rmspe
0 0.000134 0.000115 0.010655
1 0.000125 0.000103 0.010107
2 0.000125 0.000107 0.010284
3 0.000121 0.000108 0.0103

If you train it on the full data the result is even better, 3, metrics=[exp_rmspe])
epoch trn_loss val_loss exp_rmspe
0 0.025221 0.026524 0.171249
1 0.011784 0.016318 0.12995
2 0.008176 0.011945 0.109922

though there’s some overfitting here, 2, metrics=[exp_rmspe],cycle_len=4,cycle_mult=2)
epoch trn_loss val_loss exp_rmspe
0 0.006899 0.010108 0.100926
1 0.00597 0.008739 0.093547
2 0.005956 0.008498 0.092325
3 0.005774 0.008442 0.092023
4 0.005096 0.007497 0.086524
5 0.004815 0.006881 0.082773
6 0.004539 0.006469 0.080078
7 0.004369 0.006257 0.078753
8 0.004157 0.006165 0.078192
9 0.004057 0.0061 0.077783
10 0.004095 0.006074 0.077606
11 0.004098 0.006064 0.077537


awesome work