Kaggle grocery competition

So, Following Jeremy’s advice to simply submit a column of zeros, I did that for the Grocery data challenge. I got position 184 on the leaderboard and, to my surprise, there was a bunch of people that perform worse than guessing all zeros. haha. Anyway, a useful exercise folks. It’s a way to take the first step on your journey. All I did was download the sample_submission.csv, change the name to lame_submission.csv, compress it, and submitted via the submit button. Give it a try!


@jeremy Can you tell us better ways to deal with larger data sets for EDA and prediction. For e.g Grocery dataset is 5GB

I would probably take a random subsample

I would use AWS

But for prediction you would want to process the whole dataset right

Yeah but exploration subsampling might be a good idea since it’s hard to plot a very big data set or looking at interaction between variables. And even for trying new things for prediction you may use a subsample since having substantial amount of data would generalize well and in my opinion give a good idea about approaches with whole data


Here is the steps I took to get this done.

install shuf in your server
then shuf -n 10000 train.csv > rand_train.csv
head -1 train.csv > head.csv
cat rand_train.csv head.csv > what you want.
hope it helps.


easier to just use nrows in read_csv() to avoid lots of files laying around.


I was trying to get a random sample. when I tried to do that with pandas it would take too long.

1 Like

Yup I like the ‘shuf’ approach, since it’s nice to be able to switch between different samples easily. In this case, I’d suggest having two files - one which is the last million rows, one which is a random sample of a million of the last ten million rows:

head -1 train.csv > train_headers.csv
tail -n 1000000 train.csv > train_1m.csv
cat train_headers.csv train_1m.csv > train_1m_hdr.csv
tail -n 10000000 train.csv | shuf | head -n 1000000 > train_1m_rnd.csv
cat train_headers.csv train_1m_rnd.csv > train_1m_rnd_hdr.csv

And here’s a really handy kernel that shows how to load the files with minimal memory use - which will make all your processing faster too: https://www.kaggle.com/heyt0ny/read-data-for-low-memory-usage


If you use this approach, you would want to use skip_rows, since otherwise you’d have the first few rows (which are the earliest dates) rather than the last few (which are the most recent dates, and the ones that will be most useful for modeling).

1 Like

Note that you can get in the top 40 on the leaderboard by calculating per-group means, no machine learning algorithms required: https://www.kaggle.com/ceshine/mean-baseline-lb-59

And you can get in the top 100 with a single (long) line of code!

pd.read_csv("../input/test.csv", usecols=[0,2,3] ).set_index(
 ['item_nbr','store_nbr'] ).join( 
 pd.read_csv('../input/train.csv', usecols=[0,2,3,4],
 converters={'unit_sales':lambda u: float(u) if float(u)>0 else 0},
 skiprows=range(1,124035460) ).groupby( ['item_nbr','store_nbr'] )
 ['unit_sales'].median().to_frame('unit_sales'), how='left').fillna(0)
 .to_csv('median.csv', float_format='%.2f', index=None) 

I got to 162 with the log of the mean unit_sales. haha. about to submit actual random forest predictions.

Something for people to think about before Tuesday: how can you improve on the per-group means method shown in the kernel linked above? No clever math or ML needed - just a little common sense! :slight_smile:

If you can figure out a way, try it out - you might be able to break into the top 30 of this competition without too much work…

1 Like

doh! missed that shuf in there! good one!

BTW from the Kaggle kernels, here’s how to do random sample in pandas:

n = 125497040 #number of records in file (excludes header)
s = 10000 #desired sample size
select = sorted(random.sample(range(1,n+1),s))
skip  = tuple(set(range(1,n+1)) - set(select))
df_train = pd.read_csv("../input/train.csv",skiprows=skip)

You could use a variant of that to get a random sample of the last few rows too.


I’ve been using df.sample(frac=0.1). Is that the same thing?

“A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.”

Would excluding outliers in calculations help ?

Edit: skiprows already excludes this period

Yes, except if you don’t have enough RAM to load the dataset in the first place, you can’t use that approach.

Why my job gets killed instantly on aws even though I am taking sample of train data and I am using t2.medium with 30 gb storage? Should I upgrade the storage or there may be some other problem?