I am working in a dataset where I think I would benefit from SentencePiece tokenization as it happened in the PolEval challenge and many other problems. I also saw the approach recommended multiple times on the course and forum.
Using the python bindings for the library I attempted an implementation trying to keep it as consistent as possible with the current tokenization strategies, but I needed to create both a
Tokenizer and a
Vocab object simultaneously. This is what I got so far.
I am using it to train and fine-tune my language model and it seems to be working correctly. Any feedback on the implementation? Do you think it’s a valuable addition? If that’s the case I can add a couple of tests and send the PR.
All feedback is welcome. For instance, I had doubts about considering
xxfld a special token or allowing SentencePiece to split it at will.