How to tokenize large strings?


I am currently trying to work with Amazon reviews dataset It has a training file of size 1.5GB. Spacy throws String size error while I try to tokenize the file. Hence, I broke it into independent files for each review and ended up with around 3 million files. As expected, its taking more than an hour to read and tokenize those files. I finally killed the kernel as I couldn’t wait on it more.

I am wondering if there is a batch process for tokenizing the data. This will allow me to see the status of how many files have been tokenized. I thought of iterating over the reviews in chunks but not sure how to feed it to the LanguageModel Dataset in parts. Any suggestions ? If such thing is not in place currently then would it be a good addition to fastai library ?

1 Like

That sounds like a very interesting problem. I’m aware that the current library isn’t well suited to huge datasets, but I’ve not needed to tackle on myself yet so haven’t written anything.

So yes if you can make it work on that dataset, that would be a nice addition!

1 Like

Thanks @jeremy !! I shall look into it.