I am trying to train a language model from scratch.
I managed to train a tokenizer and save the tmp file with the model and vocab files to my google drive but I was logged out of colab and was unable to save the dataloaders.
To use pretrained tokenizer instead of retraining it, would I just do something like this?
TextBlock.from_folder(path, vocab='spm.vocab', is_lm=True, tok=None)
TextBlock.from_folder(path, vocab=None, tok=SubWordTokenizer())