I am currently trying to work with Amazon reviews dataset https://www.kaggle.com/bittlingmayer/amazonreviews. It has a training file of size 1.5GB. Spacy throws String size error while I try to tokenize the file. Hence, I broke it into independent files for each review and ended up with around 3 million files. As expected, its taking more than an hour to read and tokenize those files. I finally killed the kernel as I couldn’t wait on it more.
I am wondering if there is a batch process for tokenizing the data. This will allow me to see the status of how many files have been tokenized. I thought of iterating over the reviews in chunks but not sure how to feed it to the LanguageModel Dataset in parts. Any suggestions ? If such thing is not in place currently then would it be a good addition to fastai library ?