No method to save Dataloaders

wjs20 · November 6, 2020, 1:44pm

I have created a Dataloaders object for use in a language model but was logged out of colab and lost it.

Is there any was of saving the Dataloaders object? It took a long time to create so it would be nice if I could just load back in a saved version of it.

I saved the tmp file with the .vocab and the .model file in it in my drive. How can I reistantiate my dls with these files instead of retraining it?

Thanks

muellerzr · November 6, 2020, 3:06pm

Just do a torch.save(learn.dls, myfname) and learn.dls = torch.load(myfname).

wjs20 · November 6, 2020, 3:41pm

Would I have to run this step before hand?

learn = language_model_learner(dls, AWD_LSTM, pretrained=False)

muellerzr · November 6, 2020, 3:57pm

This would be how I would go about it:

learn.export(myModel)
torch.save(learn.dls, 'myfname')

learn = load_learner(myModel)
learn.dls = torch.load('myfname')

wjs20 · November 6, 2020, 4:02pm

Ok thanks. I’m trying to figure out the best way to go about this on google colab where I periodically get logged out for inactivity! It would be nice just to have milestones along the way saved and in my google drive so I can load them back in without any need to retrain them if I get logged out. How do you usually manage the proccess?

muellerzr · November 6, 2020, 4:03pm

I don’t normally train language models in Colab. Otherwise if I do I keep it active in another tab (I also have pro so that helps some). I normally use Paperspace for cases like that where I don’t want to look at it all day long (or train on my own GPU)

wjs20 · November 6, 2020, 4:08pm

When I created my dataloaders object it created a tmp folder with .model and .vocab files in it from the tokenizer, which I managed to move into my google drive. If I recreate my objects by running the notebook again how to I get my dataloaders to use these files instead of retraining the tokenizer? could I just do

prot_lm = DataBlock(blocks=TextBlock.from_df('Sequence', is_lm=True, tok='tmp/spm.vocab'),
                    get_x=ColReader(0),
                    splitter=RandomSplitter(0.1))

Instead of

prot_lm = DataBlock(blocks=TextBlock.from_df('Sequence', is_lm=True, tok=SentencePieceTokenizer()),
                    get_x=ColReader(0),
                    splitter=RandomSplitter(0.1))

Thanks!