Latest anaconda3 dies loading 4.9G file

Python 3.6 is not amazing me so far. This CSV load kills the interpreter itself with a memory problem!!

sales = pd.read_csv('../data/train.csv', parse_dates=['date'], 
                    low_memory=False)
python avg.py 
python(56943,0x7fffe21b03c0) malloc: *** mach_vm_map(size=18446744071562067968) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
python(56943,0x7fffe21b03c0) malloc: *** error for object 0x106652ac1: pointer being freed was not allocated
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6

Just a heads to people. Makes it kind of hard to use Python 3 for machine learning on a real problem but…

Hmm…seems to be a mac thing.At least the Sierra version of the operating system. see https://github.com/xonsh/xonsh/issues/2078 They are claiming that gnu readline libraries the issue but I fixed it in the meantime by not using parameter low_memory=False. Takes 2 minutes 30 seconds to load the 4.9G.That seems slow

I think it may just be that you’re less familiar with the details of this environment, compared to what you’re used to. Pandas is about as fast as any other data frame reader of equivalent functionality in my testing - if you find something better please do share what you come across. (I know there are very basic CSV readers that are faster, but they don’t construct real data frames, just parse out very basic structures).

In this case, your problem is using ‘low_memory=False’, when in fact you do have a shortage of memory compared to the file size!

Most practitioners I know use Python 3 or R for machine learning on real problems…

I have 32G of ram. Even 4 bytes per character reading everything in the memory at once should only be 20G. It appears to be a bug in Mac OS Sierra.

I thought that I had seen read_csv go faster, which is why I said it seemed slow. It definitely seems very fast to me in general, but I guess I’ve never tried to load a table this big in one go, particularly given py3 might use int for chars.

After loading chars (hopefully line by line) into memory, I get about 9G for the frame per getsizeof but 6G per pandas:

print("getsizeof in bytes: %d" % sys.getsizeof(sales))
print("size in bytes: %d" % sales.memory_usage(index=True).sum())
getsizeof in bytes: 9,067,029,472
size in bytes: 6,023,858,000

Python 3 uses Unicode strings by default, plus has some overhead for object management. So the initial load will be memory intensive. Once you’ve converted to pandas categories, things will be much better (unless most of your strings are unique). If you can’t load all at once to create the categories, you could:

  • Do it column by column, or
  • Use Dask (which is the main package for handling data that’s too big for RAM), or
  • Pre-process at the command line into integer codes, or
  • Other tricks we’ll learn in class :slight_smile:

We’ll be talking about large file handling in some detail next week, FYI.

1 Like

groovy. making progress. i have merged tables for grocery contest and am about to shove it into a RF (after following your data prep procedures)

I too am failing to load these data into memory (even with low_memory=True). The kernel of my jupyter notebook, which I am connected to through tunneling into my t2.medium AWS instance, keeps crashing/restarting on me every time I try to load the ~5GB train.csv data into memory. I believe that it is due to the fact that t2.medium only has 4GB memory.

Should I launch a larger instance in order to do any processing of these data or wait to work on this competition until we discuss large file handling next week? Thanks for any advice you can provide.

Also note that even 4 bytes/char gives only 20G so reading entire thing as string should work. Python2:

with open('/Users/parrt/github/kaggle-grocery/data/train.csv') as f:
        s = f.read()
        print("bytes %d" % sys.getsizeof(s))

gets 4.9G so no problem with 32G of ram:

time python load.py
bytes 4997452325

real	0m3.318s
user	0m0.009s
sys	0m3.300s

Dang it. Python 3 doesn’t accept that. My environment must be hosed. I get:

    s = f.read()
OSError: [Errno 22] Invalid argument

Boooo! Py3 doc says it’s cool. Naturally stepping into python source code is a fool’s errand. Nothing makes sense (from PyCharm anyway). Bah!

Ah! No doubt same error as I warn students about. Nope. my file was load.py. I thought that was it. Same error with myload.py.

Virtual memory should make that possible but slow. Hmm…

interesting. it appears to be a size limit issue. 4G is > 32bits signed so maybe I downloaded 32-bit anaconda? Nope. 64 bit. 3_000_000_000 fails but 2_000_000_000 works in f.read(...). I’ll try mmap().

Here’s how to handle it in <4GB, from the Kaggle forum (which is a great place to look for tips!):

Also, here’s how to create, in bash, 2 files: 1 containing the last (most recent) million rows, and 1 containing a million randomly selected rows of the last 10 million rows. I always recommend doing as much work on a subset as possible.

head -1 train.csv > train_headers.csv
tail -n 1000000 train.csv > train_1m.csv
cat train_headers.csv train_1m.csv > train_1m_hdr.csv
tail -n 10000000 train.csv | shuf | head -n 1000000 > train_1m_rnd.csv
cat train_headers.csv train_1m_rnd.csv > train_1m_rnd_hdr.csv

Don’t forget folks, use man to learn how the above commands work, if you don’t already know.

3 Likes

pandas read_csv lets you read a chunk at a time, so no need to read the whole thing as a string, or use mmap.

Yup. Just trying to show you it’s not a memory problem by loading it all in at once meaning pandas should be able to handle it with that flag and not crash

To understand why pandas has some overhead, you may find some of the comments here interesting: http://wesmckinney.com/blog/apache-arrow-pandas-internals/ . It’s from the author of pandas, and feather, and explains how things will be changing in the future.

(None of this need discourage anyone from using pandas or python 3 in practice, since there are many ways to handle large datasets effectively, and we’ll be studying some of them in this course. However we won’t be able to get into a lot of detail, since it’s a big field - there’s a whole Stanford course dedicated to it: https://lagunita.stanford.edu/courses/course-v1:ComputerScience+MMDS+SelfPaced/about . Highly recommended!)

1 Like

More data. python2 and 3 crash w/o low mem arg. both take about 2min to load full data set. Note I still get a warning despite giving dtype and low mem:

sys:1: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125497040 entries, 0 to 125497039
Data columns (total 6 columns):
id             int32
date           datetime64[ns]
store_nbr      int16
item_nbr       int32
unit_sales     float32
onpromotion    object
dtypes: datetime64[ns](1), float32(1), int16(1), int32(2), object(1)
memory usage: 3.5+ GB
None
types = {'id':'int32', 'data':'datetime64', 'store_nbr':'int16', 'item_nbr':'int32',
         'unit_sales':'float32', 'onpromotion':'bool'}
sales = pd.read_csv('../data/train.csv',
                    dtype=types,
                    parse_dates=['date'],
                    infer_datetime_format=True, # supposed to make it faster sometimes
                   low_memory=lowmem # fails on Mac OS Sierra due to readline bug?
                    # ,nrows=500000 #Actually it's easier to do head from command-line
                   )

Any ideas why I get same warning?

With dtype set you shouldn’t set low_memory. It’s hard to see what’s going on since you’ve just pasted a fragment of your code above, but your dtype dictionary isn’t working right. Try using the code from https://www.kaggle.com/heyt0ny/read-data-for-low-memory-usage . If that doesn’t work for you, then we’ll need to look at whether something is going odd on your Mac…

that is all the code minus imports and setting lowmem; including next. I got the types and other arg ideas from that link and similar at kaggle.

This python3 loadtrain.py fails:

import pandas as pd
import sys

lowmem = False
if len(sys.argv)>1 and sys.argv[1]=='-lowmem':
    lowmem = True

types = {'id':'int32', 'data':'datetime64', 'store_nbr':'int16', 'item_nbr':'int32',
         'unit_sales':'float32', 'onpromotion':'bool'}

sales = pd.read_csv('../data/train.csv',
                    dtype=types,
                    parse_dates=['date'],
                    infer_datetime_format=True # supposed to make it faster sometimes
                    ,low_memory=lowmem # fails if false on Mac OS Sierra due to readline bug?
                    # ,nrows=500000 #Actually it's easier to do head from command-line
                   )
print(sales.info())

With low_memory=False it always dies a miserable death. Changing onpromotion to object doesn’t change anything.

The problem I had was that I misspelled date as data in types. When I spelled it correctly, I got a warning not to specify date formats and use parse_dates, which I was already doing. I still get a warning with this code:

import pandas as pd
import sys

lowmem = False
if len(sys.argv)>1 and sys.argv[1]=='-lowmem':
    lowmem = True

types = {'id':'int32', 'store_nbr':'int16', 'item_nbr':'int32',
         'unit_sales':'float32', 'onpromotion':'bool'}

sales = pd.read_csv('../data/train.csv',
                    dtype=types,
                    parse_dates=['date'],
                    infer_datetime_format=True,
                    low_memory=lowmem
                   )
print(sales.info())

Using the same args and these types from that webpage:

# From https://www.kaggle.com/heyt0ny/read-data-for-low-memory-usage
types = {
    'id': 'int64',
    'item_nbr': 'int32',
    'store_nbr': 'int8',
    'unit_sales': 'float32',
    'onpromotion': bool,
}

I still get the warning.

In summary, that message appears no matter what I do. here’s another variation that doesn’t specify the argument

types = {'id':'int32', 'store_nbr':'int16', 'item_nbr':'int32',
         'unit_sales':'float32', 'onpromotion':'bool'}

sales = pd.read_csv('../data/train.csv',
                    dtype=types,
                    parse_dates=['date'],
                    infer_datetime_format=True
                   )

I get

sys:1: DtypeWarning: Columns (5) have mixed types. Specify dtype option on import or set low_memory=False.

Whatever, it loads and I can move on to the next thing that dies for memory reasons. haha

Oh right you have lowmem=False. That’s not what you want when manually setting dtypes. Set it to True, and it’ll all work nicely.

Yeah, it has been working with low mem True but I still get the warning (not the malloc/free error). No biggie.