Can anyone provide me a libraries for WordPiece tokenization?

(Ariyan Hasan) #1

I want to build a vocabulary based on wordpiece instead of words. can anyone tell me the process of building wordpices vocabulary from some sentences or any library that is able to do this?
Thanks

0 Likes

(Daniel Armstrong ) #2

You might want to try SentencePiece

0 Likes

(Daniel Armstrong ) #3

Here is one example of creating wordpice tokens:

From what I can tell SentencePiece is the wordpices that were used to train BERT.

Here is a paper that you might find interesting
https://www.aclweb.org/anthology/D18-2012

0 Likes