60 GB of Text to classify

gerardo · August 29, 2018, 4:00am

I’m trying to tokenize 60 GB of files.
Do you have any samples of batch tokenization?
I’m trying to come up the corpus and dictionary for those files.

MicPie · August 29, 2018, 9:37am

Have a look at the IMDb notebook from part 2 and the notes.

Chunking the data and using multiple cores could be of help to reduce the required time.

I’m currently shortly before trying to tokenize a smaller file (<5GB) so we can exchange here in some good approaches!

Best regards
Michael

gerardo · August 29, 2018, 4:02pm

Michael,
Thanks for the reply.
I’m doing that already.
The issue here is to load the 60 GB corpus included with 120K files.
I need to load that corpus to create the tokens and do the conversion to training data.
That’s just the corpus.
After that I will need to do the other calculations.

There should be a way to have RAM/disk object (like a database) but with the numpy/pandas capabilities and accessibility.

arijun · August 30, 2018, 9:10am

Have you thought about writing a custom tokenizer class that lazily reads each file from the disk as it’s needed?

If you want to use pandas specifically, pandas has a chunksize argument in the pd.read_* functions, which will then turn it into an iterator that pulls up one chunk at a time.

samh · September 1, 2018, 4:42am

If you haven’t seen dask, it mimics the numpy and pandas APIs and is meant for this situation. I haven’t tried it yet, though.

knesgood · September 1, 2018, 4:07pm

Just a caveat that the Dask folks point out it’s not a replacement for Pandas, as it doesn’t have the same capabilities.

MicPie · September 2, 2018, 2:52pm

Dear Gerardo,

I guess you can only try what has been mentioned so far:

pandas with chunking and multiple CPUs (this is already implemented in the imdb notebook)
dask
get an online machine capable of handling the huge data files with the above options (i.e. with a lot of RAM).

Here are some links I was going through when I had a similar problem:

(However, in the end for this project proper setup of the pandas data types did the trick for me.)

If you have 120k files with 60GB you could try to use dask to load them into one dask df and save it and then load it again with pandas incl. chunking and multi CPU.
I guess it will be trick if you need to apply functions to the data which are based on the entire data set.

I’m currently trying to figure out how to load my 3GB data set to the GPU with 8 GB RAM, as I’m getting an CUDA out-of-memory error.
This will be very likely a problem you will also encounter with your 60 GB file.

Best regards
Michael