Corporación Favorita Grocery Sales Forecasting

All,

I’ve refactored much of the winning model and spent the last few days analyzing. Here is the notebook if you want to review.

I ended up with 403 features and unfortunately I’m still not able to get great results from training it.

So I built a Random Forest model to take a deeper look at the relative ranking of the features. Not surprising to anyone who spent time with this data … only averages of the recent sales data seems to predict very much.

Here are the top 10:

	feature	score
0	sum_14_before	0.454851
1	mean_14_before	0.314681
2	mean_30_before	0.061810
3	sum_30_before	0.054516
4	mean_40_before	0.029909
5	sum_40_before	0.028628
6	promo_14_after	0.008888
7	store_class_dow	0.005143
8	dow	0.005110
9	item_dow	0.004356

I’ve learned a ton working through this dataset … but good results still elude me. Hopefully the work is helpful to others.

4 Likes

Just in case you missed it … look how simple it was to create the RF after first building the NN.

Literally took only 2 lines of code. Was able to use all of the same feature and target dataframes.

In case anyone is still interested in this competition (I am), I have published a clean notebook with my own work so far: (https://github.com/jonas-pettersson/fast-ai/blob/master/Exploration%20and%20Prediction%20for%20Structured%20Data.ipynb)
I am still not anywhere near a good result (my best score was 0.614), but I think the notebook can be of help to a newcomer. I am of course also very grateful for any feedback.
You can read my conclusions at the end of the notebook, but here the short version: it is not sufficient to throw this problem at a deep neural network and hope for the best. I started this exercise by not looking at any forums or kernels, just to see how far I would get on my own based on the Rossman example from the DL course and all I learned from the ML course.
Not very far it turned out. First I had to repair this thing about zero sales missing in training data. And after that, only when I added the feature “moving average”, as suggested by @kevindewalt, things started go in the right direction.
Anyway, I learned a lot. Not only about the practical use of the fast.ai library but also in not giving up in face of frustrating setbacks. Kaggle competitions are a great way to learn because you get feedback via your scores and you can learn from others.
If I get some more time I will continue down the path of going through kernels and trying to find out what I can improve, probably adding more (“engineered”) features. I might also come back once I have understood LSTMs more thoroughly, as many seem to use that. Even if the data set feels very hard, I think it is a good learning example as it is close to reality with all problems that come with it.
Or I might look for some “living” competition with structured data instead…

2 Likes