Adding SentencePieceTokenizer to fastai.text.data


(Julian Eisenschlos) #1

Hi everybody,

I am working in a dataset where I think I would benefit from SentencePiece tokenization as it happened in the PolEval challenge and many other problems. I also saw the approach recommended multiple times on the course and forum.

Using the python bindings for the library I attempted an implementation trying to keep it as consistent as possible with the current tokenization strategies, but I needed to create both a Tokenizer and a Vocab object simultaneously. This is what I got so far.

I am using it to train and fine-tune my language model and it seems to be working correctly. Any feedback on the implementation? Do you think it’s a valuable addition? If that’s the case I can add a couple of tests and send the PR.

All feedback is welcome. For instance, I had doubts about considering xxfld a special token or allowing SentencePiece to split it at will.


SentencePiece + ULMFit
(Jeremy Howard (Admin)) #2

Sounds like a great project! I’d suggest leaving a PR for 5 weeks, since the course is running so we’re changing a lot. If you can keep your notebook up to date with the library, then we can look to merge the ideas once we have a little time.


(Fred Guth) #3

Does this substitutes SpaCy?


(Julian Eisenschlos) #4

Correct, the idea was to create a plug & play replacement for the Spacy tokenizer.


(Fred Guth) #5

My only complain about SpaCy is that it takes around 1Gb of space and is, by far, the large dependency in fastai. Because of SpaCy it is impossible to automatically add fastai models to AWS Lambda. Do you know how large is SentencePiece?


(Kaspar Lund) #6

Hi I look and worked with your pr and wonder how fastai custom tokens are handled:
TK_MAJ,TK_UP,TK_REP,TK_WREP = ‘xxmaj’,‘xxup’,‘xxrep’,‘xxwrep’

shouldn’t we pass the argument --user_defined_symbols =[TK_MAJ,TK_UP,TK_REP,TK_WREP] (formattet appropriately) so that sentencepiece doesn’t try to tokenize them ? See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc

Other comments

  1. you set BOS = -1 . shouldn’t it be --bos_id=1
  2. do you know what FLD =xxfld’ is used for ?

(Julian Eisenschlos) #7

Hi @Kaspar right now I am not giving a special value to those tokens, I am letting sentencepiece use them as they are or split them. I expect it to have little impact, but if you get evidence that adding that flag is better, I’d love to know. It should be easy to add that as a configurable param.

About the comments. I don’t think fastai uses any explicit BOS token (last I checked): 0 is UNK and 1 is PAD.
but it uses this xxfld <COLUMN_NUMBER> at the beginning of each column from your tabular data. I took the same approach there allowing sentencepiece to split as convenient, I observed in my tests that it split it in two: xx + fld.

Once the fastai text.transform and text.data APIs stabilize I can work on the PR to make readily this accessible, I already spotted some changes with rules becoming pre and post rules. Post rules of course won’t make sense in this setup.


(Kaspar Lund) #8

ok i have created this issue at sentencepice in order to make the integration easier:

Until then i have modified your tokenizer member function to make the conversion in fastai
class SentencepieceTokenizer(BaseTokenizer):
def init(self, path:PathOrStr, cache_name:str=‘tmp’):
self.tok = spm.SentencePieceProcessor()
self.tok.Load(str(Path(path) / cache_name / ‘m.model’))
self.vocab_ = SentencepieceTokenizer.loadvocab_(path, cache_name)

def tokenizer(self, t:str) -> List[str]:
    #get the tokens and replace unk from sentencepiece with unk from fastai
    return [text.transform.UNK if t=="<unk>" else t for t in self.tok.EncodeAsPieces(t)]

def add_special_cases(self, toks:Collection[str]):
    #this should have been done when training sentencepiece
    pass

Hope this goes in the right direction. Will report back later


(Bobak Farzin) #9

I am also trying to use sentencepiece to build a language model. I took a different approach and built my own custom tokenizer class and then using the sentencepiece vocab I create the Vocab() object as needed.

I believe that sentencepiece is to be used on the raw data (or as raw as you can get it) rather than on tokenized data. So I have applied it to the raw data for the Wiki103 use case. Can anyone confirm that is true?

I wrote this short gist to try show how I am doing it end-to-end. The DataBunch can then be fed into a language model and used in the cannonical ways. Notably, I don’t handle post_rules in this example (but I think that could be easy to add.)

I agree, having a method to easily switch between spaCy and Sentencepiece (or other tokenizers) would be great. Maybe working on the PR is the best way to do that but for those that want to try it out now I hope that this is helpful for others.


(Kaspar Lund) #10

yes! if by raw you mean using the default unigram model in sentencepiece and feeding it the wiki with no pretokenization .

For me the difficult part has been to grasp how to integrate the control and user defined tokens with sentencepiece’ unigram model. I thing i got i right now, but i still need to verify the integration by checking performance in imdb classification. Have a look here: https://github.com/kasparlund/nlp/blob/ce69c7c5912cbc16107c430db2bf4880b0e9ac0b/fastai_sentencepiece.py#L92


(Bobak Farzin) #11

your code looks reasonable to me. I would think you can just proceed with trying it out on a small amount of data and confirm that you get what you expect to get from the SP tokenizer.

Another ‘gotcha’ I discovered today is that the default .load() method will not load your custom tokenizer as the processor. So, if you save and then load, you will not be able to predict() unless you specify your processor. Have a look at the code below, I think that will get you on your way once you are at that point.

from fastai.text.data import _get_processor
my_processor = _get_processor(mycust_tok,sp_vocab)
data = TextLMDataBunch.load(path=PATH,cache_name='sp_tokenizer',processor=my_processor)