Kaggle grocery competition

Here is the steps I took to get this done.

install shuf in your server
then shuf -n 10000 train.csv > rand_train.csv
head -1 train.csv > head.csv
cat rand_train.csv head.csv > what you want.
hope it helps.

3 Likes

easier to just use nrows in read_csv() to avoid lots of files laying around.

2 Likes

I was trying to get a random sample. when I tried to do that with pandas it would take too long.

1 Like

Yup I like the ‘shuf’ approach, since it’s nice to be able to switch between different samples easily. In this case, I’d suggest having two files - one which is the last million rows, one which is a random sample of a million of the last ten million rows:

head -1 train.csv > train_headers.csv
tail -n 1000000 train.csv > train_1m.csv
cat train_headers.csv train_1m.csv > train_1m_hdr.csv
tail -n 10000000 train.csv | shuf | head -n 1000000 > train_1m_rnd.csv
cat train_headers.csv train_1m_rnd.csv > train_1m_rnd_hdr.csv

And here’s a really handy kernel that shows how to load the files with minimal memory use - which will make all your processing faster too: https://www.kaggle.com/heyt0ny/read-data-for-low-memory-usage

2 Likes

If you use this approach, you would want to use skip_rows, since otherwise you’d have the first few rows (which are the earliest dates) rather than the last few (which are the most recent dates, and the ones that will be most useful for modeling).

1 Like

Note that you can get in the top 40 on the leaderboard by calculating per-group means, no machine learning algorithms required: https://www.kaggle.com/ceshine/mean-baseline-lb-59

And you can get in the top 100 with a single (long) line of code!

pd.read_csv("../input/test.csv", usecols=[0,2,3] ).set_index(
 ['item_nbr','store_nbr'] ).join( 
 pd.read_csv('../input/train.csv', usecols=[0,2,3,4],
 converters={'unit_sales':lambda u: float(u) if float(u)>0 else 0},
 skiprows=range(1,124035460) ).groupby( ['item_nbr','store_nbr'] )
 ['unit_sales'].median().to_frame('unit_sales'), how='left').fillna(0)
 .to_csv('median.csv', float_format='%.2f', index=None) 
5 Likes

I got to 162 with the log of the mean unit_sales. haha. about to submit actual random forest predictions.

Something for people to think about before Tuesday: how can you improve on the per-group means method shown in the kernel linked above? No clever math or ML needed - just a little common sense! :slight_smile:

If you can figure out a way, try it out - you might be able to break into the top 30 of this competition without too much work…

1 Like

doh! missed that shuf in there! good one!

BTW from the Kaggle kernels, here’s how to do random sample in pandas:

n = 125497040 #number of records in file (excludes header)
s = 10000 #desired sample size
select = sorted(random.sample(range(1,n+1),s))
skip  = tuple(set(range(1,n+1)) - set(select))
df_train = pd.read_csv("../input/train.csv",skiprows=skip)

You could use a variant of that to get a random sample of the last few rows too.

2 Likes

I’ve been using df.sample(frac=0.1). Is that the same thing?

“A magnitude 7.8 earthquake struck Ecuador on April 16, 2016. People rallied in relief efforts donating water and other first need products which greatly affected supermarket sales for several weeks after the earthquake.”

Would excluding outliers in calculations help ?

Edit: skiprows already excludes this period

Yes, except if you don’t have enough RAM to load the dataset in the first place, you can’t use that approach.

Why my job gets killed instantly on aws even though I am taking sample of train data and I am using t2.medium with 30 gb storage? Should I upgrade the storage or there may be some other problem?

The issue is RAM, not hard drive space.

t2.medium has 4 GB RAM. You’re probably trying to load the 5 GB train.csv file all at once into a dataframe. Since it cannot fit in RAM, it might be causing a crash.

Read the entire thread linked below - solutions to this problem are posted here:

2 Likes

Part of becoming a programmer is learning to say something other than “it doesn’t work”, which is what you are saying when you say “job gets killed instantly.” You need to be specific as in “I get the following error when I run this specific program.”

I will surely take care of this in future.

I am using random forest to practice what is taught in class using Grocery dataset. I was wondering weather predictions coming from random forest are bounded between minimum and maximum of unit_sales per category feed in to the model. So if there is a increasing/decreasing trend in unit_sales I doubt that random forest will be able to capture that in the predictions.

Well spotted - yes they are bounded and it’s a major issue to consider. Although since in this case it’s only 2 weeks ahead to predict, I doubt it’s a problem here

1 Like