A predictably random and flexible approach to using SentencePiece for your LM tasks

wgpubs · June 7, 2019, 7:02pm

See my notebook for preparing wiki data for SentencePiece and LM here.

This notebook is heavily inspired by the work of @Kaspar and his excellent work here. I’m hoping to both get feedback on the code and the why’s behind what I’m doing here, as well as, offer another way for folks to consider in preparing their data for LM tasks using SentencePiece.

Objective : Create a more standardized and predictably random way to preparing text data for SentencePiece training that, in addition to being fastai friendly, will result in a vocab/training set that will be comparable across models.

My thinking is this: How can you really compare results between two models if folks are using completely different techniques to structure the .txt files their SentencePiece vocab is trained on?

Notes & Assumptions :

I’m assuming that any special tokens needed for training SentencePiece must exist in the .txt files. As such, I’ve updated the “spm_rules” to include rules for both pre and post tokenization. This allows us to insert, if desired, tokens like TK_MAJ and TK_UP.
I’ve also added the ability to include BOS, EOS, and FLD tokens into the .txt files as desired since they can’t be inserted later.
Added the ability to incorporate multiple columns (or json attributes in the case of the wiki files) into the resulting .txt files. This could be really useful for other datasets where you may have several columns that represent textual data that you want SP to learn.
Including both raw and tokenized data for each column in the big .csv file.
By including token counts for each text column you want to train against in the .csv file, users can use that data however they wish in preparing their training/validation data for LM training.
Notice also I sort the wiki files before randomization and the inclusion of an optional random seed. This is done so that they are processed in a predictable way.
In the .csv I include the .txt file that every article is in. This way, if you want to create a random training/validation set … and just train your SPM on the training set, you can use this information to pull the right .txt files for SP and eliminate potential data leakage.
Made Tokenizer/Detokinzer optional. Created a Base WikiTokenizer class you can implement to use Moses, Spacy, or whatever. If its there it is used. If not, it will split based on ’ ’

Questions:

What incorrect assumptions am I making?
What improvements can be made?
Will this work for other languages (especially languages like Chinese, Japanese, etc…)?

Anyways, would love to get feedback and/or even collaborate on making this better if interested.