What to do when training takes forever?

mgloria · July 29, 2019, 6:39am

Hi. After doing part I course on gcp default setup I am now working on my own projects of image classification. My problem is: the training takes forever. About 60 minutes for epoch which makes it really hard to experiment and have fun. I am sure, many of you have reached this point. My question is: what should I do next? I have been taking a look at this but it is not clear to me, for instance, where is the learner being saved after the training! I would just need a more gentle introduction.

eljas1 · July 29, 2019, 7:59am

Jeremy always recommends using only a small fraction of the dataset when doing preparations. Try cutting down the amount of samples. When you have played around a bit with the data you can leave the training running overnight with the full dataset.

As for saving the model, by default the model is saved in the same folder as the training images. You can change this by setting a new path with learn.path = 'path/to/folder' for example, and the learner will save and load from that path afterwards.

mgloria · July 29, 2019, 8:21pm

Thanks @eljas1. My problem is that even if using an early stop callback, how can I then switch off for instance the gcp vm instance so that it does not keep charging me during the night without doing anything?

Regarding, your second point, this is Jeremyś code:

from fastai.vision import *
from fastai.vision.models.wrn import wrn_22
from fastai.distributed import *
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()
torch.cuda.set_device(args.local_rank)
torch.distributed.init_process_group(backend='nccl', init_method='env://')

path = untar_data(URLs.CIFAR)
ds_tfms = ([*rand_pad(4, 32), flip_lr(p=0.5)], [])
data = ImageDataBunch.from_folder(path, valid='test', ds_tfms=ds_tfms, bs=128).normalize(cifar_stats)
learn = Learner(data, wrn_22(), metrics=accuracy).to_distributed(args.local_rank)
learn.fit_one_cycle(10, 3e-3, wd=0.4, div_factor=10, pct_start=0.5)

He is not doing learn.save() anywhere… is still the model being saved?

eljas1 · July 30, 2019, 6:11am

I run all my code in Google Colab since they give you a free GPU and I wanted to skip the hassle of managing paid instances. Note that the Colab instances reset after about 10 hours and the files are deleted, so any files must be saved on Google Drive. So I unfortunately cannot give advice on paid instances.

The models are not saved to disk automatically. You need to call learn.save('some_model_name') whenever you want to save it.

dreambeats · July 30, 2019, 6:43am

I’ll further add that you can then find those saved weights at

path/'model'/'some_model_name.pth'

mgloria · July 30, 2019, 7:41am

@dreambeats you mean after saving them, correct?

But as far as I understood @eljas1 colab has only 1 gpu (tesla k80)… so you cannot do parallel training, correct?

dreambeats · July 30, 2019, 9:25am

Yes after saving them like eljas1 mentioned.

ankush_jain · September 19, 2019, 12:51am

seems right.