NLP - recommended data storage format to store large tokenized text

jbuzza · August 1, 2019, 7:09pm

Assuming you are using fast.ai, the TextDataBunch will contain your preprocessed text. You can then use the supplied save and load_data methods.

You may need to experiment with the different methods of creating the TextDataBunch although from_folder sounds like the most relevant. A chunksize for the Tokenizer and Numericalizer (processors) can be specified as a parameter.