Can you pass vocab file to TextBlock.from_df()

wjs20 · November 7, 2020, 10:01am

I am trying to train a language model from scratch.

I managed to train a tokenizer and save the tmp file with the model and vocab files to my google drive but I was logged out of colab and was unable to save the dataloaders.

To use pretrained tokenizer instead of retraining it, would I just do something like this?

TextBlock.from_folder(path, vocab='spm.vocab', is_lm=True, tok=None)

Instead of

TextBlock.from_folder(path, vocab=None, tok=SubWordTokenizer())

Thanks!

msivanes · November 11, 2020, 11:39am

Did you try the above and it worked for you?

wjs20 · November 11, 2020, 12:02pm

I think you have to add your spm.model file as an argument to SubWordTokenizer() and read your spm.vocab file into a list object and add that as an argument to .from_folder()