60 GB of Text to classify

arijun · August 30, 2018, 9:10am

Have you thought about writing a custom tokenizer class that lazily reads each file from the disk as it’s needed?

If you want to use pandas specifically, pandas has a chunksize argument in the pd.read_* functions, which will then turn it into an iterator that pulls up one chunk at a time.