Adding SentencePieceTokenizer to fastai.text.data


(Julian Eisenschlos) #1

Hi everybody,

I am working in a dataset where I think I would benefit from SentencePiece tokenization as it happened in the PolEval challenge and many other problems. I also saw the approach recommended multiple times on the course and forum.

Using the python bindings for the library I attempted an implementation trying to keep it as consistent as possible with the current tokenization strategies, but I needed to create both a Tokenizer and a Vocab object simultaneously. This is what I got so far.

I am using it to train and fine-tune my language model and it seems to be working correctly. Any feedback on the implementation? Do you think it’s a valuable addition? If that’s the case I can add a couple of tests and send the PR.

All feedback is welcome. For instance, I had doubts about considering xxfld a special token or allowing SentencePiece to split it at will.


SentencePiece + ULMFit
(Jeremy Howard (Admin)) #2

Sounds like a great project! I’d suggest leaving a PR for 5 weeks, since the course is running so we’re changing a lot. If you can keep your notebook up to date with the library, then we can look to merge the ideas once we have a little time.


(Fred Guth) #3

Does this substitutes SpaCy?


(Julian Eisenschlos) #4

Correct, the idea was to create a plug & play replacement for the Spacy tokenizer.


(Fred Guth) #5

My only complain about SpaCy is that it takes around 1Gb of space and is, by far, the large dependency in fastai. Because of SpaCy it is impossible to automatically add fastai models to AWS Lambda. Do you know how large is SentencePiece?


(Kaspar Lund) #6

Hi I look and worked with your pr and wonder how fastai custom tokens are handled:
TK_MAJ,TK_UP,TK_REP,TK_WREP = ‘xxmaj’,‘xxup’,‘xxrep’,‘xxwrep’

shouldn’t we pass the argument --user_defined_symbols =[TK_MAJ,TK_UP,TK_REP,TK_WREP] (formattet appropriately) so that sentencepiece doesn’t try to tokenize them ? See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc

Other comments

  1. you set BOS = -1 . shouldn’t it be --bos_id=1
  2. do you know what FLD =xxfld’ is used for ?

(Julian Eisenschlos) #7

Hi @Kaspar right now I am not giving a special value to those tokens, I am letting sentencepiece use them as they are or split them. I expect it to have little impact, but if you get evidence that adding that flag is better, I’d love to know. It should be easy to add that as a configurable param.

About the comments. I don’t think fastai uses any explicit BOS token (last I checked): 0 is UNK and 1 is PAD.
but it uses this xxfld <COLUMN_NUMBER> at the beginning of each column from your tabular data. I took the same approach there allowing sentencepiece to split as convenient, I observed in my tests that it split it in two: xx + fld.

Once the fastai text.transform and text.data APIs stabilize I can work on the PR to make readily this accessible, I already spotted some changes with rules becoming pre and post rules. Post rules of course won’t make sense in this setup.


(Kaspar Lund) #8

ok i have created this issue at sentencepice in order to make the integration easier:

Until then i have modified your tokenizer member function to make the conversion in fastai
class SentencepieceTokenizer(BaseTokenizer):
def init(self, path:PathOrStr, cache_name:str=‘tmp’):
self.tok = spm.SentencePieceProcessor()
self.tok.Load(str(Path(path) / cache_name / ‘m.model’))
self.vocab_ = SentencepieceTokenizer.loadvocab_(path, cache_name)

def tokenizer(self, t:str) -> List[str]:
    #get the tokens and replace unk from sentencepiece with unk from fastai
    return [text.transform.UNK if t=="<unk>" else t for t in self.tok.EncodeAsPieces(t)]

def add_special_cases(self, toks:Collection[str]):
    #this should have been done when training sentencepiece
    pass

Hope this goes in the right direction. Will report back later