Grocery Competition Testing and Validation set

tinapeng · November 10, 2017, 7:14pm

Regarding how to choose the proper validation set for the grocery competition, Jeremy @jeremy mentioned yesterday that the best they tried so far is ‘same day range 1 month earlier’ or something similar, that gave a good linear fit between validation and testing set results.

I don’t understand how this works. My understanding is that we have 30-ish different datasets, each one is one day in a month and we make predictions respectively for each day? Is that correct?

jeremy · November 10, 2017, 8:51pm

I’m not sure what you mean by 30 different datasets. There’s only one test csv file made available on the Kaggle web site?..

tinapeng · November 10, 2017, 8:55pm

I was trying to figure out how to get the ‘same day range 1 month earlier’ validation set that you mentioned for the grocery competition. I thought it meant we split the data into 1st, 2nd, 3rd… 30th day of the month as separated dataset and predict them separately. Is that what you mean for the validation set?

jeremy · November 10, 2017, 9:13pm

I simply meant selecting rows with date after 15th July and before 1st August.

tinapeng · November 10, 2017, 9:15pm

oh… I overthinked. Thank you for clarifying!