Corporación Favorita Grocery Sales Forecasting

I’m curious about this too.

Errr… I’m pretty sure if you asked in the comments section on Kaggle, or on KaggleNoobs, someone will find the original culprit and get you an answer.
This is my humble opinion but I agree with it. :innocent:

Looking at the Kaggle discussion boards it doesn’t seem like many of the teams understand the basics of why it is working. My hypothesis:

They found an additional feature and the simple ffnn designs I was using will give similar results.

Will let you know after I dig into it.

1 Like

Changing his original code as he suggested in https://www.kaggle.com/shixw125/1st-place-nn-model-public-0-507-private-0-513/code#269624

This kernel is based on senkin13’s kernel: https://www.kaggle.com/senkin13/lstm-starter. You can replace model.add(LSTM(512, input_shape=(X_train.shape[1],X_train.shape[2]))) with model.add(Dense(512, input_dim=X_train.shape[1])), I think there is no difference.

Generates the following error (I didn’t try to investigate, just pasted it and run the code).

1 Like

if you comment out below lines:
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))
X_val = X_val.reshape((X_val.shape[0], 1, X_val.shape[1]))

and
change the input_shape=(X_train.shape[0],X_train.shape[1]) should work I think.

Thanks a lot for your help @s.s.o but it didn’t work (not you to blame, just my noobish debugging skills).
It’s pretty late in Stockholm now, maybe 02:00, so I’ll give it a try again tomorrow.
In any case, the basic Jupyter Notebook should be working fine so I’ll try and post it on GitHub, :+1:

Here’s an edited Jupyter Notebook for the 1st place solution.
This version, running on 15 epochs per set, 40sec per epoch on 1080Ti, scores 0.519 on the Private LB to get a Silver Medal.

Or on nbviewer with a direct Download button (upper right corner)

6 Likes

With @radek’s comment, I got the “But of course !” moment about the 16 networks, each one dedicated to forecasting a single day of the 16 days in Test :upside_down_face:

Another thing I found very neat is his careful choice of validation dates: he didn’t go for the last 16 days before the Test starting date (2017-8-16) , bluntly that should be 2017-7-31 -> 8-15.

He chose instead the latest 16 days’ Train bracket which most resembled the 16 days’ Test, that is 2017-7-26 -> 8-9.

Doing so, he made sure the two sets had the same number of respective weekdays (like 3 medium sales volume Wednesdays/Thursdays, vs 3 low volume Mondays/Tuesdays.) + it fully captured the end-of-month week-end where payroll is about to drop but is a banking holiday for payment with credit/Visa cards, so people won’t be charged until next Monday (a validation starting on Monday July 31 would miss the boost of previous final friday/saturday of July).

There’s true business knowledge in Retail, imho.

2 Likes

I’m working on porting my simple NN from keras to fastai.

I ran into a problem with the apply_cats method. I created the categories on my training set but the test set has some item_nbr values not present in the training set.

I refactored to add an optional null_check. Since this has potential performance implications I left it as an optional check.

Posting here first to see if anyone has comments/feedback before submitting a pull request. This method is looking a bit awkward, probably ways to refactor.

def apply_cats(df, trn, null_check=False):
    for n,c in df.items():
        if (n in trn.columns) and (trn[n].dtype.name=='category'):
            df[n] = pd.Categorical(c, categories=trn[n].cat.categories, ordered=True)
            if null_check and df[n].isnull().values.any():
                raise ValueError(f'Target dataframe has null values for column {n}. This can occur if the target dataframe has category values not present in original.')

Today I ported my simple NN from Keras to fastai.

Here is the original Keras version and Here is the fastai version.

(I left out the data prep because it is the same.)

I tried to create the same model, loss functions, embeddings, etc. to see if I could get the same results.

It’s worth noting that this is an unusual dataset - so don’t draw any sweeping conclusions from the comparisons. I only have 4, features, 20K trainable params. There 37 million training examples.

Plus I could have made a mistake. :slight_smile: Maybe @jeremy will see something I missed.

Observations

Fastai is much, much easier to use than Keras.

I had to write much less code to prep for training. Building a custom loss function was trivially easy. Fastai has lots of helping functions that just makes it easier.

No way I’m going back to Keras.

Unfortunately my loss in the fastai model is about 2x worse than what I achieved in Keras.

Here is what I checked.

  • Feature engineering is the same. I ran the same python code as previously to get the 3 category and 1 continuous features.
  • The models have the same number of trainable parameters.
  • The models have the same loss function, MAE.
  • I normalized both targets using the same function.

Training the fastai model is 10-20x slower than Keras.

I think Keras is copying the entire dataset to the GPU and then doing the training. Since I’m using gigantic batch sizes it runs very fast in Keras, about 22 secs. Takes >5 mins in fastai.

Questions

  • Anyone have suggestions for speeding up the pytorch for situations like this? Worth noting that the winner used batch size of >65,000 as well.
  • Any suggestions for getting a lower validation MAE for the fastai version?

A bit of a long shot but the difference in training speed might be due to not using multiple cores for loading the data. Could you please check your CPU saturation using something like htop?

Slightly doubt this might be the culprit but was the best idea I got thus far.

(another thing would be pinning the memory - I think it might be an argument to one of the methods for dataset creation, maybe check the signatures and see if you can set it to true?)

Sorry for such far fetched ideas but that is all I have at this point… If the loss is twice as high then whatever other thing is amiss here might be causing the slowdown as well (some difference in architecture, etc).

1 Like

A bit of a long shot but the difference in training speed might be due to not using multiple cores for loading the data. Could you please check your CPU saturation using something like htop?

I think that’s it. IIRC training under Keras would max my CPUs. With fastai it barely got above 1%. I’ll try to go back and test.

At the moment I’m digging through this mess looking for other features.

Pardon the rant … but I’m annoyed at the level of cutting-and-pasting by Kagglers who don’t actually understand what the code is doing. The level of rigor between the fastai community and Kaggle is astounding.

At the moment I’m digging through this mess looking for other features.

Exactly the challenging part for me to understand from the whole competition :grinning: Than I sit and study the multi indexing and grouping in pandas again :grinning:

Also thank you for sharing the code… I think batch size bigger than usual was some how not a big problem for time series. Even it may score better… I tried it also for Recruit Restaurant Visitor Forecasting challenge and it was not problem two.

Nope, I was wrong. fastai also pegs CPU. So that’s not it …

Ok, I’ve cracked the code. :sunglasses:

Here is my notebook.

The winner took 3 key steps to generate the features:

  1. Build dataframes indexed on store/item with dates as columns.
  2. Pick a single date and build time-based features on that date.
  3. Pick another date, build the same feature, and concatenate with the previous date.

90% of the code expands on these 3 steps. I created simple, 1-feature example in the notebook above which illustrates it.

The target variable is simply a prediction of the sales for a store/item combination on any number of subsequent days. 16 days is used in this example because the test set is 16 days long.

Genius or madness?

I can’t decide if I hate or love this approach. There is a lot I don’t like about it. Removing “date” as the organizing principle for the training data makes the entire workflow complex and unintuitive. There also seems to be enormous feature redundancy.

On the positive side it is more “pythonic” (and much faster) to generate the time-based features by running functions across DataFrame columns rather than looping through date rows as in the Rossman example. Presumably you could follow the same technique for weather data or anything else date related.

In any case, I’ll have to spend some more time with it to see if I can beat the winning entry with a simpler approach.

An early hypothesis on why it may have won …

It may be that simply training each example against 16 targets (the dates) makes a more robust example than training against 1 date. The loss functions can be more accurate because they capture more information than from a single example.

Hopefully if I keep writing about this and showing off the efficiency of the efficiency of the fastai library @jeremy will weigh in with an opinion. :wink:

4 Likes

Unlikely, since I’m sitting by the pool in Lanai :wink:

3 Likes

I’m wrong, looks like they built 16 models, training each one to predict 1 date. I can’t see why they did this.

All,

I’ve refactored much of the winning model and spent the last few days analyzing. Here is the notebook if you want to review.

I ended up with 403 features and unfortunately I’m still not able to get great results from training it.

So I built a Random Forest model to take a deeper look at the relative ranking of the features. Not surprising to anyone who spent time with this data … only averages of the recent sales data seems to predict very much.

Here are the top 10:

	feature	score
0	sum_14_before	0.454851
1	mean_14_before	0.314681
2	mean_30_before	0.061810
3	sum_30_before	0.054516
4	mean_40_before	0.029909
5	sum_40_before	0.028628
6	promo_14_after	0.008888
7	store_class_dow	0.005143
8	dow	0.005110
9	item_dow	0.004356

I’ve learned a ton working through this dataset … but good results still elude me. Hopefully the work is helpful to others.

4 Likes

Just in case you missed it … look how simple it was to create the RF after first building the NN.

Literally took only 2 lines of code. Was able to use all of the same feature and target dataframes.

In case anyone is still interested in this competition (I am), I have published a clean notebook with my own work so far: (https://github.com/jonas-pettersson/fast-ai/blob/master/Exploration%20and%20Prediction%20for%20Structured%20Data.ipynb)
I am still not anywhere near a good result (my best score was 0.614), but I think the notebook can be of help to a newcomer. I am of course also very grateful for any feedback.
You can read my conclusions at the end of the notebook, but here the short version: it is not sufficient to throw this problem at a deep neural network and hope for the best. I started this exercise by not looking at any forums or kernels, just to see how far I would get on my own based on the Rossman example from the DL course and all I learned from the ML course.
Not very far it turned out. First I had to repair this thing about zero sales missing in training data. And after that, only when I added the feature “moving average”, as suggested by @kevindewalt, things started go in the right direction.
Anyway, I learned a lot. Not only about the practical use of the fast.ai library but also in not giving up in face of frustrating setbacks. Kaggle competitions are a great way to learn because you get feedback via your scores and you can learn from others.
If I get some more time I will continue down the path of going through kernels and trying to find out what I can improve, probably adding more (“engineered”) features. I might also come back once I have understood LSTMs more thoroughly, as many seem to use that. Even if the data set feels very hard, I think it is a good learning example as it is close to reality with all problems that come with it.
Or I might look for some “living” competition with structured data instead…

2 Likes