Corporación Favorita Grocery Sales Forecasting


I had a very similar experience… In general seems that IO is a major, major area to figure out - I have been having issues across multiple projects to the point where I started doing some research and writing a monster post on it :smiley: (it’s still in the works).

With regards to this specific pandas issue, I solved it doing this (basically just used an NVME drive as swap):

(Kevin Dewalt) #54

Here’s how I handled memory limitations:

  1. Watch dtypes. Convert to the smallest integers that work. Convert booleans to int8. Run periodically.
  2. Keep track of big dataframes, keep deleting them. Especially ones in loops.
  3. Buy more RAM. :slight_smile: I upgraded to 64 GB.
  4. Increase swap space on NVME drive.
  5. Within a tmux pane keep top running. Track CPU and %Mem usage.

I always have top running in a tmux pane and alias nvidia-smi -l 1 running in another. That lets me track system utilization at a glance.

Hope it helps!

(Jeremy Howard (Admin)) #55

FYI I’ve been having RAM issues for my NLP work recently, so have started using the chunklen param in pandas when reading the CSV, to process it a chunk at a time. It adds complexity and code, but it’s a good approach for large datasets.

(s.s.o) #56

Pandas also have a nice parameter ‘downcast’ for numeric types eg. pd.to_numeric(series, downcast=‘float’) When downcasted the resulting data to the smallest numerical dtype possible. As explained in the docs it follows below rules:

  • ‘integer’ or ‘signed’: smallest signed int dtype (min.: np.int8)
  • ‘unsigned’: smallest unsigned int dtype (min.: np.uint8)
  • ‘float’: smallest float dtype (min.: np.float32)

(Eric Perbos-Brinck) #57

Hey @jeremy,

You mentioned in one of the videos that you would post your Fastai notebook for Favorita, AFTER the competition ends, due to regulations and ethics for Kaggle rules :+1:

Any chance you could do so ?

There might be more than my humble self looking for it, especially how you went from training (I think I got that right) to predicting/submitting (I failed that part with Fastai library).

/kudos !

(Eric Perbos-Brinck) #58


If you’re looking for porting a Favorita top solution to Fastai library as you did with Rossmann 3rd Place: one of the 1st Place team members in Favorita posted a single Keras+Tensorflow kernel.

I tested it and it works “out of the box” with Keras: takes about 10hours to run on a 1080Ti and achieves 0.513 on Private LB (3rd place).

Here’s the Favorita Private Leaderboard

Took up to 48 Gb of RAM though during the Join_Tables & Feature Engineering phase (the swap file helped a lot).

Wiki: Lesson 1
(Kevin Dewalt) #59

Thanks for posting … I hope to dig through this in detail this weekend and port to library if I have time.


The two useful things to know here would be what is the shape of X_train?

I wonder what this layer returns, what dimensionality is the output:

model.add(LSTM(512, input_shape=(X_train.shape[1],X_train.shape[2])))

@EricPB, if you would have this already on your computer and it wouldn’t be too much of a problem, would you be so kind and check these two things?

I am thinking that output from model.summary() might also provide some insights.

I was planning to implement @Lingzhi’s model and looked at the code for quite a while where I now think I understand what it does. Am caught up with a lot of other things ATM and the 2nd part of the course is just around the corner… (still crossing my fingers I’ll get in :slight_smile: ).

The cool thing with this kernel is that we could literally copy the code to line 232 and this should give us the dataset… should be a great starting point for messing around with this.

(Eric Perbos-Brinck) #61


(1340120, 1, 561)


Layer (type)                 Output Shape              Param #   

lstm_16 (LSTM)               (None, 512)               2199552   
batch_normalization_106 (Bat (None, 512)               2048      
dropout_106 (Dropout)        (None, 512)               0         
dense_106 (Dense)            (None, 256)               131328    
p_re_lu_91 (PReLU)           (None, 256)               256       
batch_normalization_107 (Bat (None, 256)               1024      
dropout_107 (Dropout)        (None, 256)               0         
dense_107 (Dense)            (None, 256)               65792     
p_re_lu_92 (PReLU)           (None, 256)               256       
batch_normalization_108 (Bat (None, 256)               1024      
dropout_108 (Dropout)        (None, 256)               0         
dense_108 (Dense)            (None, 128)               32896     
p_re_lu_93 (PReLU)           (None, 128)               128       
batch_normalization_109 (Bat (None, 128)               512       
dropout_109 (Dropout)        (None, 128)               0         
dense_109 (Dense)            (None, 64)                8256      
p_re_lu_94 (PReLU)           (None, 64)                64        
batch_normalization_110 (Bat (None, 64)                256       
dropout_110 (Dropout)        (None, 64)                0         
dense_110 (Dense)            (None, 32)                2080      
p_re_lu_95 (PReLU)           (None, 32)                32        
batch_normalization_111 (Bat (None, 32)                128       
dropout_111 (Dropout)        (None, 32)                0         
dense_111 (Dense)            (None, 16)                528       
p_re_lu_96 (PReLU)           (None, 16)                16        
batch_normalization_112 (Bat (None, 16)                64        
dropout_112 (Dropout)        (None, 16)                0         
dense_112 (Dense)            (None, 1)                 17
Total params: 2,446,257
Trainable params: 2,443,729
Non-trainable params: 2,528


Thank you Eric :+1: :slight_smile:

I have no clue why the first layer is an LSTM nor what it does. I do not think his statement in the comments is correct that it is equivalent to a Dense layer.

It’s really cool though. Could it be that the remember / forget gates are still learned? Sort of like we are using the LSTM cell to take a look at the data and keep only the parts that are important?

Well, this is crazy. Don’t think I will be able to wrap my head around this as I don’t want to explore that keras layer further.

One way to find out would be rerunning the training with the Dense layer instead.

Either way, this is really helpful :slight_smile: Thanks a lot @EricPB! I also find the part where the 561 vector is being fed into the model quite mind boggling. Intuitively feeding it something of shape [type_of_data x days_in_train] makes more sense. For instance, I think Lighzi stacked a vector of unit_sales and a vector of promo_days and some statistics I think to form something of dimensionality [3 x days_in_train]

(Eric Perbos-Brinck) #63

I’m working (or my PC is :upside_down_face:) on a lighter/faster version, with less epochs per step (15 max, so Callbacks probably can’t kick in), to experiment with different parameters, including the “why choose LSTM over Dense in the first layer”.

I’ll post a revised Jupyter Notebook on GitHub so everyone can experiment as well.


  1. As it is, you’ll still need at least 50Gb RAM to run it, due to the “Preparing Dataset” cells #24 to #27.
    On my rig, this is not a big issue because I have 32Gb “real” RAM and allocated 140Gb “SWAP File” from the 1Tb Samsung 960 NVMe, so it’s rather painless. But if you don’t have a SWAP helping, the notebook may crash.

  2. The whole kernel is without any comment, so expect some reverse-engineering work to figure out what is done by each cell.

  3. The current open question for me: why did he choose to run 16 networks -or steps- ? Is it related to the duration of the Test set (ie 16 days) ? If I were to use his template for another project with a Test set of 25 days, should I need 25 networks ? Or just a coincidence ?


From what I was able to understand from looking at the code… he seems to be keeping the X_train constant, it doesn’t change during the training.

But he constructs 16 neural nets, one for each day. He just changes the target value to be predicted but the train sequences stay the same.

This is very neat in many ways given the data for this competition.

(Jeremy Howard (Admin)) #65

I’m curious about this too.

(Eric Perbos-Brinck) #66

Errr… I’m pretty sure if you asked in the comments section on Kaggle, or on KaggleNoobs, someone will find the original culprit and get you an answer.
This is my humble opinion but I agree with it. :innocent:

(Kevin Dewalt) #67

Looking at the Kaggle discussion boards it doesn’t seem like many of the teams understand the basics of why it is working. My hypothesis:

They found an additional feature and the simple ffnn designs I was using will give similar results.

Will let you know after I dig into it.

(Eric Perbos-Brinck) #68

Changing his original code as he suggested in

This kernel is based on senkin13’s kernel: You can replace model.add(LSTM(512, input_shape=(X_train.shape[1],X_train.shape[2]))) with model.add(Dense(512, input_dim=X_train.shape[1])), I think there is no difference.

Generates the following error (I didn’t try to investigate, just pasted it and run the code).

(s.s.o) #69

if you comment out below lines:
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))
X_val = X_val.reshape((X_val.shape[0], 1, X_val.shape[1]))

change the input_shape=(X_train.shape[0],X_train.shape[1]) should work I think.

(Eric Perbos-Brinck) #70

Thanks a lot for your help @s.s.o but it didn’t work (not you to blame, just my noobish debugging skills).
It’s pretty late in Stockholm now, maybe 02:00, so I’ll give it a try again tomorrow.
In any case, the basic Jupyter Notebook should be working fine so I’ll try and post it on GitHub, :+1:

(Eric Perbos-Brinck) #71

Here’s an edited Jupyter Notebook for the 1st place solution.
This version, running on 15 epochs per set, 40sec per epoch on 1080Ti, scores 0.519 on the Private LB to get a Silver Medal.

Or on nbviewer with a direct Download button (upper right corner)

(Eric Perbos-Brinck) #72

With @radek’s comment, I got the “But of course !” moment about the 16 networks, each one dedicated to forecasting a single day of the 16 days in Test :upside_down_face:

Another thing I found very neat is his careful choice of validation dates: he didn’t go for the last 16 days before the Test starting date (2017-8-16) , bluntly that should be 2017-7-31 -> 8-15.

He chose instead the latest 16 days’ Train bracket which most resembled the 16 days’ Test, that is 2017-7-26 -> 8-9.

Doing so, he made sure the two sets had the same number of respective weekdays (like 3 medium sales volume Wednesdays/Thursdays, vs 3 low volume Mondays/Tuesdays.) + it fully captured the end-of-month week-end where payroll is about to drop but is a banking holiday for payment with credit/Visa cards, so people won’t be charged until next Monday (a validation starting on Monday July 31 would miss the boost of previous final friday/saturday of July).

There’s true business knowledge in Retail, imho.