Latest anaconda3 dies loading 4.9G file

that is all the code minus imports and setting lowmem; including next. I got the types and other arg ideas from that link and similar at kaggle.

This python3 loadtrain.py fails:

import pandas as pd
import sys

lowmem = False
if len(sys.argv)>1 and sys.argv[1]=='-lowmem':
    lowmem = True

types = {'id':'int32', 'data':'datetime64', 'store_nbr':'int16', 'item_nbr':'int32',
         'unit_sales':'float32', 'onpromotion':'bool'}

sales = pd.read_csv('../data/train.csv',
                    dtype=types,
                    parse_dates=['date'],
                    infer_datetime_format=True # supposed to make it faster sometimes
                    ,low_memory=lowmem # fails if false on Mac OS Sierra due to readline bug?
                    # ,nrows=500000 #Actually it's easier to do head from command-line
                   )
print(sales.info())

With low_memory=False it always dies a miserable death. Changing onpromotion to object doesn’t change anything.

The problem I had was that I misspelled date as data in types. When I spelled it correctly, I got a warning not to specify date formats and use parse_dates, which I was already doing. I still get a warning with this code:

import pandas as pd
import sys

lowmem = False
if len(sys.argv)>1 and sys.argv[1]=='-lowmem':
    lowmem = True

types = {'id':'int32', 'store_nbr':'int16', 'item_nbr':'int32',
         'unit_sales':'float32', 'onpromotion':'bool'}

sales = pd.read_csv('../data/train.csv',
                    dtype=types,
                    parse_dates=['date'],
                    infer_datetime_format=True,
                    low_memory=lowmem
                   )
print(sales.info())

Using the same args and these types from that webpage:

# From https://www.kaggle.com/heyt0ny/read-data-for-low-memory-usage
types = {
    'id': 'int64',
    'item_nbr': 'int32',
    'store_nbr': 'int8',
    'unit_sales': 'float32',
    'onpromotion': bool,
}

I still get the warning.

In summary, that message appears no matter what I do. here’s another variation that doesn’t specify the argument

types = {'id':'int32', 'store_nbr':'int16', 'item_nbr':'int32',
         'unit_sales':'float32', 'onpromotion':'bool'}

sales = pd.read_csv('../data/train.csv',
                    dtype=types,
                    parse_dates=['date'],
                    infer_datetime_format=True
                   )

I get

sys:1: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.

Whatever, it loads and I can move on to the next thing that dies for memory reasons. haha

Oh right you have lowmem=False. That’s not what you want when manually setting dtypes. Set it to True, and it’ll all work nicely.

Yeah, it has been working with low mem True but I still get the warning (not the malloc/free error). No biggie.

Cool. The reason for the warning is because column 5 (onpromotion) wasn’t able to be made a bool. You can see it had to turn it into an object - that’s because there were some missing values, and IIRC Pandas doesn’t like missing values in bool columns.

When I use ‘object’ for that column, I get no error. After reading, I do:

df_all.onpromotion.fillna(False, inplace = True)
df_all.onpromotion = df_all.onpromotion.map({'False' : False, 'True' : True})
df_all.onpromotion = df_all.onpromotion.astype(bool)

It takes <2 mins to read and parse the whole file, and then <5 secs to save it in feather format, which isn’t bad at all! After reading it once from CSV, from then on you can read straight from feather format in a few seconds.

1 Like

excellent advice. thanks! Yeah, i did see that and changed to object but didn’t remember whether warning went away.

actually, don’t we want a trinary for onpromotion? true, false, and na? i currently have as cat then changed to cat.codes+1 to get 0, 1, 2

Yes good point, it would be best to try it as ternary. The kernel that takes averages treats NA as False and gets good results, but it would be better to test yourself of course.

BTW I forgot that onpromotion is now read as a string. So I just edited my post above to add:

df_all.onpromotion = df_all.onpromotion.map({'False' : False, 'True' : True})
1 Like

omg, i’m having so much fun! Should have tried these challenges ages ago.

1 Like

Be warned - highly addictive!

too late!! Now I’m thinking lolviz needs a DataFrame viewer for when I’m not in ipython. :slight_smile:

Did you fix your issue @jmcarpenter2?
I am also using t2.medium on AWS and see the same issue as yours @jmcarpenter2. i.e. Jupyter kernel dies and reboot when the csv is being loaded.

I also tried @parrt’s suggestion to append 32GB virtual memory to my instance. However the issue is not gone. Then I change low_memory=True. I see the following error from Jupyter after a couple more minutes.

/home/ubuntu/anaconda3/envs/fastai/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

Any clues?

Look at this comment - it addresses your issue:

1 Like

I also wrote a kernel dealing with this issue; it shows the same solution as in the comment I linked to earlier, along with some explanation. You can check it out here: https://www.kaggle.com/kunalkotian/simple-solution-for-loading-train-csv-w-o-crashing

1 Like

note that that is just a warning. please post your read_csv() line. print sales.describe() and that will tell you last column looks boolean but has NA values so it becomes an “object” (i.e., a string). set the dtype arg as I do above.

Thank you!

Anyone looking at the groceries competition should probably read all of forums.fast.ai/t/kaggle-grocery-competition/6801 and the kernels linked from there, since there’s a lot of good info now available. Post replies in that thread if you have any questions, issues, etc after reading what’s there.

I looked through the thread, but I am still having trouble loading the whole file after trying to specify dtype :pensive::
Here’s my code


It throws “MemoryError”


Does anyone can help?
Thanks!

The trick here is that you don’t need all of the data. The successful model so far used maybe the last few weeks of training data. Investigate skiprows argument to read_csv.

Okay. THANKS :relaxed: That makes my life easier.

Hi, I find a little problem.
When I fillna with False, I think the False is a bool type.
However when I map {'False' : False, 'True' : True}, I think the first ‘False’ is a string, so the last step False will return to be NaN.
And finally when I run df_all.onpromotion = df_all.onpromotion.astype(bool) I find the NaN turn out to be True.
This is my code:

df_all.onpromotion.fillna(False, inplace=True)
df_all.onpromotion = df_all.onpromotion.map({'False': False, 'True': True,False:False})
df_all.onpromotion = df_all.onpromotion.astype(bool)

sorry for my poor English expression