How to tokenize large strings?

asawant · December 1, 2017, 4:37pm

Hello,

I am currently trying to work with Amazon reviews dataset https://www.kaggle.com/bittlingmayer/amazonreviews. It has a training file of size 1.5GB. Spacy throws String size error while I try to tokenize the file. Hence, I broke it into independent files for each review and ended up with around 3 million files. As expected, its taking more than an hour to read and tokenize those files. I finally killed the kernel as I couldn’t wait on it more.

I am wondering if there is a batch process for tokenizing the data. This will allow me to see the status of how many files have been tokenized. I thought of iterating over the reviews in chunks but not sure how to feed it to the LanguageModel Dataset in parts. Any suggestions ? If such thing is not in place currently then would it be a good addition to fastai library ?

jeremy · December 2, 2017, 5:16am

That sounds like a very interesting problem. I’m aware that the current library isn’t well suited to huge datasets, but I’ve not needed to tackle on myself yet so haven’t written anything.

So yes if you can make it work on that dataset, that would be a nice addition!

asawant · December 2, 2017, 5:40am

Thanks @jeremy !! I shall look into it.