Large Text Handling

Narang · May 23, 2020, 4:50pm

I’ve a large amount of data in multiple text files, I use spacy tokenizer on this data to tokenize it and use pickle.dump to dump the tokens list (it is list that I’m dumping essentially) in a file.
After this my code has to read this dumped data, make batches and dump into bs number of files. How do I however, serially load the list of tokens using pickle to batch them without overloading my memory? Like I want pickle to give me chunks of my list so that it doesn’t overload my memory.

Thanks a lot

jeremyeast · May 25, 2020, 5:05am

The answer depends a lot on the size of your corpus and your hardware setup.

Narang · May 25, 2020, 12:42pm

Its fine I set bptt, bs both at the time of batching so I dont have problems of overflowing one single list as I kept bptt*bs constrained
Thanks a lot though