Latest anaconda3 dies loading 4.9G file

parrt · October 28, 2017, 5:22pm

that is all the code minus imports and setting lowmem; including next. I got the types and other arg ideas from that link and similar at kaggle.

This python3 loadtrain.py fails:

import pandas as pd
import sys

lowmem = False
if len(sys.argv)>1 and sys.argv[1]=='-lowmem':
    lowmem = True

types = {'id':'int32', 'data':'datetime64', 'store_nbr':'int16', 'item_nbr':'int32',
         'unit_sales':'float32', 'onpromotion':'bool'}

sales = pd.read_csv('../data/train.csv',
                    dtype=types,
                    parse_dates=['date'],
                    infer_datetime_format=True # supposed to make it faster sometimes
                    ,low_memory=lowmem # fails if false on Mac OS Sierra due to readline bug?
                    # ,nrows=500000 #Actually it's easier to do head from command-line
                   )
print(sales.info())

With low_memory=False it always dies a miserable death. Changing onpromotion to object doesn’t change anything.

The problem I had was that I misspelled date as data in types. When I spelled it correctly, I got a warning not to specify date formats and use parse_dates, which I was already doing. I still get a warning with this code:

import pandas as pd
import sys

lowmem = False
if len(sys.argv)>1 and sys.argv[1]=='-lowmem':
    lowmem = True

types = {'id':'int32', 'store_nbr':'int16', 'item_nbr':'int32',
         'unit_sales':'float32', 'onpromotion':'bool'}

sales = pd.read_csv('../data/train.csv',
                    dtype=types,
                    parse_dates=['date'],
                    infer_datetime_format=True,
                    low_memory=lowmem
                   )
print(sales.info())

Using the same args and these types from that webpage:

# From https://www.kaggle.com/heyt0ny/read-data-for-low-memory-usage
types = {
    'id': 'int64',
    'item_nbr': 'int32',
    'store_nbr': 'int8',
    'unit_sales': 'float32',
    'onpromotion': bool,
}

I still get the warning.

In summary, that message appears no matter what I do. here’s another variation that doesn’t specify the argument

types = {'id':'int32', 'store_nbr':'int16', 'item_nbr':'int32',
         'unit_sales':'float32', 'onpromotion':'bool'}

sales = pd.read_csv('../data/train.csv',
                    dtype=types,
                    parse_dates=['date'],
                    infer_datetime_format=True
                   )

I get

sys:1: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.

Whatever, it loads and I can move on to the next thing that dies for memory reasons. haha

jeremy · October 28, 2017, 5:55pm

Oh right you have lowmem=False. That’s not what you want when manually setting dtypes. Set it to True, and it’ll all work nicely.

parrt · October 28, 2017, 6:01pm

Yeah, it has been working with low mem True but I still get the warning (not the malloc/free error). No biggie.

jeremy · October 28, 2017, 6:26pm

Cool. The reason for the warning is because column 5 (onpromotion) wasn’t able to be made a bool. You can see it had to turn it into an object - that’s because there were some missing values, and IIRC Pandas doesn’t like missing values in bool columns.

When I use ‘object’ for that column, I get no error. After reading, I do:

df_all.onpromotion.fillna(False, inplace = True)
df_all.onpromotion = df_all.onpromotion.map({'False' : False, 'True' : True})
df_all.onpromotion = df_all.onpromotion.astype(bool)

It takes <2 mins to read and parse the whole file, and then <5 secs to save it in feather format, which isn’t bad at all! After reading it once from CSV, from then on you can read straight from feather format in a few seconds.

parrt · October 28, 2017, 6:27pm

excellent advice. thanks! Yeah, i did see that and changed to object but didn’t remember whether warning went away.

parrt · October 28, 2017, 6:33pm

actually, don’t we want a trinary for onpromotion? true, false, and na? i currently have as cat then changed to cat.codes+1 to get 0, 1, 2

jeremy · October 28, 2017, 6:35pm

Yes good point, it would be best to try it as ternary. The kernel that takes averages treats NA as False and gets good results, but it would be better to test yourself of course.

BTW I forgot that onpromotion is now read as a string. So I just edited my post above to add:

df_all.onpromotion = df_all.onpromotion.map({'False' : False, 'True' : True})

parrt · October 28, 2017, 6:36pm

omg, i’m having so much fun! Should have tried these challenges ages ago.

jeremy · October 28, 2017, 6:38pm

Be warned - highly addictive!

parrt · October 28, 2017, 6:40pm

too late!! Now I’m thinking lolviz needs a DataFrame viewer for when I’m not in ipython.

Sichen · October 30, 2017, 6:04am

Did you fix your issue @jmcarpenter2?
I am also using t2.medium on AWS and see the same issue as yours @jmcarpenter2. i.e. Jupyter kernel dies and reboot when the csv is being loaded.

I also tried @parrt’s suggestion to append 32GB virtual memory to my instance. However the issue is not gone. Then I change low_memory=True. I see the following error from Jupyter after a couple more minutes.

/home/ubuntu/anaconda3/envs/fastai/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)

Any clues?

kunalkotian · October 30, 2017, 8:52am

Look at this comment - it addresses your issue:

kunalkotian · October 30, 2017, 8:58am

I also wrote a kernel dealing with this issue; it shows the same solution as in the comment I linked to earlier, along with some explanation. You can check it out here: https://www.kaggle.com/kunalkotian/simple-solution-for-loading-train-csv-w-o-crashing

parrt · October 30, 2017, 2:50pm

note that that is just a warning. please post your read_csv() line. print sales.describe() and that will tell you last column looks boolean but has NA values so it becomes an “object” (i.e., a string). set the dtype arg as I do above.

Sichen · October 30, 2017, 4:35pm

Thank you!

jeremy · October 30, 2017, 4:39pm

Anyone looking at the groceries competition should probably read all of forums.fast.ai/t/kaggle-grocery-competition/6801 and the kernels linked from there, since there’s a lot of good info now available. Post replies in that thread if you have any questions, issues, etc after reading what’s there.

fliu23 · November 2, 2017, 3:30am

I looked through the thread, but I am still having trouble loading the whole file after trying to specify dtype :
Here’s my code

It throws “MemoryError”

Does anyone can help?
Thanks!

parrt · November 2, 2017, 7:15pm

The trick here is that you don’t need all of the data. The successful model so far used maybe the last few weeks of training data. Investigate skiprows argument to read_csv.

fliu23 · November 2, 2017, 7:46pm

Okay. THANKS That makes my life easier.

carl985 · May 30, 2020, 2:55am

Hi, I find a little problem.
When I fillna with False, I think the False is a bool type.
However when I map {'False' : False, 'True' : True}, I think the first ‘False’ is a string, so the last step False will return to be NaN.
And finally when I run df_all.onpromotion = df_all.onpromotion.astype(bool) I find the NaN turn out to be True.
This is my code:

df_all.onpromotion.fillna(False, inplace=True)
df_all.onpromotion = df_all.onpromotion.map({'False': False, 'True': True,False:False})
df_all.onpromotion = df_all.onpromotion.astype(bool)

sorry for my poor English expression