I’ve shared my solution on the competition forum at https://www.kaggle.com/c/favorita-grocery-sales-forecasting/discussion/47556 and share my unorganized codes on my github https://github.com/LenzDu/Kaggle-Competition-Favorita. I think some people here are still interested in this competition so I would like to share more details here.

**Model Overview**

I build 3 models: a Gradient Boosting, a CNN+DNN and a seq2seq RNN model. Final model was a weighted average of these models (where each model is stabilized by training multiple times with different random seeds then take the average). Each model separately can stay in top 1% in the private leaderboard. I use lightgbm for the Gradient Boosting model and Keras for NN models.

**LGBM:** It is an upgraded version of the public kernels. To construct the training dataset, I split the raw time sequences into time sliding windows of different lengths, and extract features from these sliding windows to predict the upcoming 16 values. Each LGBM model only predict 1 future value so I have 16 LGBM models that share the same features but don’t share parameters.

**CNN+DNN:** This is a traditional NN model, where the CNN part is a dilated causal convolution inspired by WaveNet, and the DNN part is 2 FC layers connected to raw sales sequences. Then the inputs are concatenated together with categorical embeddings and future promotions, and directly output to 16 future days of predictions.

**RNN:** This is a sequence to sequence RNN model, where an encoder and a decoder model were built for existing and future values separately. (More details here: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html). In my model, encoder and decoder are both GRUs. The hidden states of the encoder are passed to the decoder through an FC layer connector. This is not a general setting but is useful to improve the accuracy in my model.

**Feature Engineering**

For LGB, for each time periods the mean sales, count of promotions and count of zeros are included. These features are calculated with different ways of splits of the time periods, e.g. with/without promotion, each weekdays, item/store group stat, etc. Categorical features are included with label encoding.

For NN, item mean and year-ago/quarter-ago sales are fed as sequences input. Categorical features and time features (weekday, day of month) are fed as embeddings.

**Training and Validation**

For training and validation, only store-item combinations that are appeared in the test data and have at least one record during 2017 in the training data are included. For both LGB and NN, I use multiple time stamps to construct my dataset. Validation period is 2017.7.26~2017.8.10, and training periods are collected randomly during 2017.1.1~2017.7.5. However, the time sequences used for constructing features are extended to much earlier. E.g. for a training period 2017.1.1~2017.1.16, data in 2016 are used to construct the features.