5th Place Solution of Favorita Grocery Sales Forecasting

I’ve shared my solution on the competition forum at https://www.kaggle.com/c/favorita-grocery-sales-forecasting/discussion/47556 and share my unorganized codes on my github https://github.com/LenzDu/Kaggle-Competition-Favorita. I think some people here are still interested in this competition so I would like to share more details here.

Model Overview

I build 3 models: a Gradient Boosting, a CNN+DNN and a seq2seq RNN model. Final model was a weighted average of these models (where each model is stabilized by training multiple times with different random seeds then take the average). Each model separately can stay in top 1% in the private leaderboard. I use lightgbm for the Gradient Boosting model and Keras for NN models.

LGBM: It is an upgraded version of the public kernels. To construct the training dataset, I split the raw time sequences into time sliding windows of different lengths, and extract features from these sliding windows to predict the upcoming 16 values. Each LGBM model only predict 1 future value so I have 16 LGBM models that share the same features but don’t share parameters.

CNN+DNN: This is a traditional NN model, where the CNN part is a dilated causal convolution inspired by WaveNet, and the DNN part is 2 FC layers connected to raw sales sequences. Then the inputs are concatenated together with categorical embeddings and future promotions, and directly output to 16 future days of predictions.

RNN: This is a sequence to sequence RNN model, where an encoder and a decoder model were built for existing and future values separately. (More details here: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html). In my model, encoder and decoder are both GRUs. The hidden states of the encoder are passed to the decoder through an FC layer connector. This is not a general setting but is useful to improve the accuracy in my model.

Feature Engineering

For LGB, for each time periods the mean sales, count of promotions and count of zeros are included. These features are calculated with different ways of splits of the time periods, e.g. with/without promotion, each weekdays, item/store group stat, etc. Categorical features are included with label encoding.

For NN, item mean and year-ago/quarter-ago sales are fed as sequences input. Categorical features and time features (weekday, day of month) are fed as embeddings.

Training and Validation

For training and validation, only store-item combinations that are appeared in the test data and have at least one record during 2017 in the training data are included. For both LGB and NN, I use multiple time stamps to construct my dataset. Validation period is 2017.7.26~2017.8.10, and training periods are collected randomly during 2017.1.1~2017.7.5. However, the time sequences used for constructing features are extended to much earlier. E.g. for a training period 2017.1.1~2017.1.16, data in 2016 are used to construct the features.


@Lingzhi - this is terrifically cool of you to post this :slight_smile:

Could you please tell me why and how did you chose this period for the validation set?

I was thinking that the code in the repo might not be that easy to follow and also in general, for people relatively new to machine learning, it might not be easy for us to pinpoint exactly what you did based on your description.

Would you be okay if I were to try to reproduce your solution but in a less complex version (using smaller number of features, etc)? I would first like to try with the CNN+DNN part. Could I ask you an occasional question along the way? No worries, I will not bother you too much, no hand holding necessary :slight_smile: Just to check that I am on the right track.

Maybe - if you wouldn’t mind - once I recreate your model I could write a blog post detailing what I learned? I think this could be quite useful. Of course, will attribute anything of value to you there - will just contribute my perspective on how someone who has not used a model such as this before can get to an up and running state :slight_smile:

What do you think?


That would be pretty great!

1 Like

Of course! My codes are still quite messy there before I start to organize it, but feel free to fork/simplify them! Looking forward to your simplified version.

For the validation, there is a strong weekly seasonality in the dataset, so the first day of the validation period should be Wednesday as it is in the test period. 7.26-8.10 is the closest period to the test data started with Wednesday.


@Lingzhi - wanted to say again I really appreciate you sharing your solution :slight_smile: Having a great time reading your code :slight_smile:

I am not sure I understood it correctly - it didn’t seem to make a lot of difference to your results whether you infer the missing onpromotion items or whether you set all the NAs to False?

Basically I just replaced all the missing values to False for promotions. This would make the training set bias. Actually this problem had been discussed a lot in the kaggle forum. I also tried some tricky methods to remedy this, which was detailed in my post in the competition forum.

This bias problem is due to poor data preparation IMO. It’s quite tricky and difficult to handle (That’s why I didn’t mention it here). I think you can just replace the NAs to false for the beginning.

1 Like

Terrific Linghzi - thank you :slight_smile: Knowing this I can now peacefully continue!

@Lingzhi Congratulations! That’s quite the achievement! I’d love to hear a little more about the seq2seq model applied in this way. I’ve seen it for translation as you share in your example but I’ve never seen it used for time series prediction or for regression. It’s quite an interesting application.


Ya, seq2seq models had been proved to be very powerful for time series problems in recent kaggle competitions. One of the most successful examples was Arthur’s 1st place solution for web traffic forecast here: https://www.kaggle.com/c/web-traffic-time-series-forecasting/discussion/43795

Basically the structure I used here was quite similar to the seq2seq models applied to translation problems. It had an encoder RNN for the input sequence with a decoder RNN for the output sequence, and they are connected by the hidden state. I think seq2seq models may be useful to many different problems whose input and output are both sequences.


Once I get around to recreating a simplified version of the DNN+CNN model I will work on the seq2seq :slight_smile:

I have some other things I need to work on but in order to not over promise and under deliver (sort of prefer it the other way around :wink: ) I will have a simplified pytorch version of the DNN+CNN model ready within the next 2 weeks (along with fully runnable jupyter notebook, some explanation, etc).

Hi Radek, Have you done it? If yes, can you please share?

THank you


Apologies, I unfortunately have not completed this.