Kaggle grocery competition

parrt · October 26, 2017, 3:59pm

So, Following Jeremy’s advice to simply submit a column of zeros, I did that for the Grocery data challenge. I got position 184 on the leaderboard and, to my surprise, there was a bunch of people that perform worse than guessing all zeros. haha. Anyway, a useful exercise folks. It’s a way to take the first step on your journey. All I did was download the sample_submission.csv, change the name to lame_submission.csv, compress it, and submitted via the submit button. Give it a try!

nishanm · October 26, 2017, 4:47pm

@jeremy Can you tell us better ways to deal with larger data sets for EDA and prediction. For e.g Grocery dataset is 5GB

kcturgutlu · October 26, 2017, 4:53pm

I would probably take a random subsample

ytian22 · October 26, 2017, 4:57pm

I would use AWS

nishanm · October 26, 2017, 4:58pm

But for prediction you would want to process the whole dataset right

kcturgutlu · October 26, 2017, 5:00pm

Yeah but exploration subsampling might be a good idea since it’s hard to plot a very big data set or looking at interaction between variables. And even for trying new things for prediction you may use a subsample since having substantial amount of data would generalize well and in my opinion give a good idea about approaches with whole data

smortezavi · October 27, 2017, 11:55pm

Here is the steps I took to get this done.

install shuf in your server
then shuf -n 10000 train.csv > rand_train.csv
head -1 train.csv > head.csv
cat rand_train.csv head.csv > what you want.
hope it helps.

parrt · October 27, 2017, 11:59pm

easier to just use nrows in read_csv() to avoid lots of files laying around.

smortezavi · October 28, 2017, 1:22am

I was trying to get a random sample. when I tried to do that with pandas it would take too long.

jeremy · October 28, 2017, 1:55am

Yup I like the ‘shuf’ approach, since it’s nice to be able to switch between different samples easily. In this case, I’d suggest having two files - one which is the last million rows, one which is a random sample of a million of the last ten million rows:

head -1 train.csv > train_headers.csv
tail -n 1000000 train.csv > train_1m.csv
cat train_headers.csv train_1m.csv > train_1m_hdr.csv
tail -n 10000000 train.csv | shuf | head -n 1000000 > train_1m_rnd.csv
cat train_headers.csv train_1m_rnd.csv > train_1m_rnd_hdr.csv

And here’s a really handy kernel that shows how to load the files with minimal memory use - which will make all your processing faster too: https://www.kaggle.com/heyt0ny/read-data-for-low-memory-usage

jeremy · October 28, 2017, 1:57am

If you use this approach, you would want to use skip_rows, since otherwise you’d have the first few rows (which are the earliest dates) rather than the last few (which are the most recent dates, and the ones that will be most useful for modeling).

jeremy · October 28, 2017, 2:01am

Note that you can get in the top 40 on the leaderboard by calculating per-group means, no machine learning algorithms required: https://www.kaggle.com/ceshine/mean-baseline-lb-59

And you can get in the top 100 with a single (long) line of code!

pd.read_csv("../input/test.csv", usecols=[0,2,3] ).set_index(
 ['item_nbr','store_nbr'] ).join( 
 pd.read_csv('../input/train.csv', usecols=[0,2,3,4],
 converters={'unit_sales':lambda u: float(u) if float(u)>0 else 0},
 skiprows=range(1,124035460) ).groupby( ['item_nbr','store_nbr'] )
 ['unit_sales'].median().to_frame('unit_sales'), how='left').fillna(0)
 .to_csv('median.csv', float_format='%.2f', index=None)

parrt · October 28, 2017, 4:04am

I got to 162 with the log of the mean unit_sales. haha. about to submit actual random forest predictions.

jeremy · October 28, 2017, 4:14pm

Something for people to think about before Tuesday: how can you improve on the per-group means method shown in the kernel linked above? No clever math or ML needed - just a little common sense!

If you can figure out a way, try it out - you might be able to break into the top 30 of this competition without too much work…

parrt · October 28, 2017, 6:04pm

doh! missed that shuf in there! good one!

jeremy · October 29, 2017, 9:38pm

BTW from the Kaggle kernels, here’s how to do random sample in pandas:

n = 125497040 #number of records in file (excludes header)
s = 10000 #desired sample size
select = sorted(random.sample(range(1,n+1),s))
skip  = tuple(set(range(1,n+1)) - set(select))
df_train = pd.read_csv("../input/train.csv",skiprows=skip)

You could use a variant of that to get a random sample of the last few rows too.

parrt · October 29, 2017, 10:52pm

I’ve been using df.sample(frac=0.1). Is that the same thing?

kcturgutlu · October 29, 2017, 11:54pm

“A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.”

Would excluding outliers in calculations help ?

Edit: skiprows already excludes this period

jeremy · October 30, 2017, 1:25am

Yes, except if you don’t have enough RAM to load the dataset in the first place, you can’t use that approach.

nimesh280 · November 1, 2017, 4:28am

Why my job gets killed instantly on aws even though I am taking sample of train data and I am using t2.medium with 30 gb storage? Should I upgrade the storage or there may be some other problem?