Feature Request: SentencePieceTokenizer pass --input_sentence_size

FelixRe · December 12, 2020, 9:05am

I noticed that SentencePieceTokenizer has not property for passing the --input_sentence_size to SentencePieceTrainer.Train()

Is there a reason for it? Allowing to set it might help with out of memory issues related to Sentencepiece: https://github.com/google/sentencepiece/issues/341

Of course a too large corpus might still be an issue down the road when training the model…

What do you think?