I noticed that SentencePieceTokenizer has not property for passing the --input_sentence_size to SentencePieceTrainer.Train()
Is there a reason for it? Allowing to set it might help with out of memory issues related to Sentencepiece: https://github.com/google/sentencepiece/issues/341
Of course a too large corpus might still be an issue down the road when training the model…
What do you think?