Cool. The reason for the warning is because column 5 (onpromotion) wasn’t able to be made a bool. You can see it had to turn it into an object - that’s because there were some missing values, and IIRC Pandas doesn’t like missing values in bool columns.
When I use ‘object’ for that column, I get no error. After reading, I do:
It takes <2 mins to read and parse the whole file, and then <5 secs to save it in feather format, which isn’t bad at all! After reading it once from CSV, from then on you can read straight from feather format in a few seconds.
Yes good point, it would be best to try it as ternary. The kernel that takes averages treats NA as False and gets good results, but it would be better to test yourself of course.
BTW I forgot that onpromotion is now read as a string. So I just edited my post above to add:
Did you fix your issue @jmcarpenter2?
I am also using t2.medium on AWS and see the same issue as yours @jmcarpenter2. i.e. Jupyter kernel dies and reboot when the csv is being loaded.
I also tried @parrt’s suggestion to append 32GB virtual memory to my instance. However the issue is not gone. Then I change low_memory=True. I see the following error from Jupyter after a couple more minutes.
/home/ubuntu/anaconda3/envs/fastai/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
note that that is just a warning. please post your read_csv() line. print sales.describe() and that will tell you last column looks boolean but has NA values so it becomes an “object” (i.e., a string). set the dtype arg as I do above.
Anyone looking at the groceries competition should probably read all of forums.fast.ai/t/kaggle-grocery-competition/6801 and the kernels linked from there, since there’s a lot of good info now available. Post replies in that thread if you have any questions, issues, etc after reading what’s there.
The trick here is that you don’t need all of the data. The successful model so far used maybe the last few weeks of training data. Investigate skiprows argument to read_csv.
Hi, I find a little problem.
When I fillna with False, I think the False is a bool type.
However when I map {'False' : False, 'True' : True}, I think the first ‘False’ is a string, so the last step False will return to be NaN.
And finally when I run df_all.onpromotion = df_all.onpromotion.astype(bool) I find the NaN turn out to be True.
This is my code: