Adding SentencePieceTokenizer to fastai.text.data

bfarzin · January 2, 2019, 10:26pm

I am also trying to use sentencepiece to build a language model. I took a different approach and built my own custom tokenizer class and then using the sentencepiece vocab I create the Vocab() object as needed.

I believe that sentencepiece is to be used on the raw data (or as raw as you can get it) rather than on tokenized data. So I have applied it to the raw data for the Wiki103 use case. Can anyone confirm that is true?

I wrote this short gist to try show how I am doing it end-to-end. The DataBunch can then be fed into a language model and used in the cannonical ways. Notably, I don’t handle post_rules in this example (but I think that could be easy to add.)

I agree, having a method to easily switch between spaCy and Sentencepiece (or other tokenizers) would be great. Maybe working on the PR is the best way to do that but for those that want to try it out now I hope that this is helpful for others.