Keep training an already exported model

Hey! I’m training an image classifier with my own data. I’m using google colab for the task. Fastai version = 2.3.0. I’m using my own data and I have 2 main problems:

  1. I have too much data to load to google drive all at once (I connect google colab to google drive), therefore I had to separate it into 11 folders randomly distributed.
  2. Each epoch takes around 8 hours to run, so (even with colab pro) I can only do 1 epoch at a time before I get disconnected and I need a way to save after that and load again to do the next epoch or next folder.

I think that it’s possible to resume the epoch even if the runtime is disabled (Resume training with fit_one_cycle) but what i really need is a realiable way to train, save, load, and train all over again to iterate over my multiple folders

Here is my code:

os.chdir(’/content/drive/MyDrive/Data1’)

path = ‘/content/drive/MyDrive/Data1/train’
xray = DataBlock(blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files(str(path+’/train’)),
splitter = partial(GrandparentSplitter(),),
get_y=parent_label)

path = ‘/content/drive/MyDrive/Data1’
categ = os.listdir(os.path.join(path, “train”))
print(categ)

path_anno = str(path + ‘/’ + ‘annotations’)

pre_path_img = []
def path_helper():
for category in categ:
pre_path_img.append(str(path + ‘/’ + category))
path_img = ‘’
for pre in pre_path_img:
path_img = path_img + ', ’ + str(pre)
return path_img

path_img = path_helper()

dls = ImageDataLoaders.from_folder(path, train=‘train’, test=‘test’, valid=‘val’, bs=8)
dls.train_ds.items[:3]
dls.valid.show_batch(max_n=8, nrows=2)

learn = cnn_learner(dls, resnet50, metrics=error_rate)
learn.fine_tune(1)
learn.export(fname=‘export1.pkl’)

#THIS IS WHERE PROBLEMS BEGIN

learn.load("/content/drive/MyDrive/Data1/models/export1.pkl")
learn.fine_tune(1)
learn.export(fname=‘export2.pkl’)

The first time I ran learn.fine_tune(1) it worked ok. The model was exported as export1.pkl. However, when I try to load export1.pkl it give the error:

FileNotFoundError: [Errno 2] No such file or directory: ‘/content/drive/MyDrive/Data1/models/export1.pkl.pth’

For what I notice, it expects the model to be saved as .pth

Seeing this I tried the same thing with an older project in witch I trained it all in one go. The trained model was saved as “/content/drive/MyDrive/model1.pth”, and I tried:

learn = cnn_learner(dls, resnet50, pretrained=True, metrics=error_rate)#.to_fp16()
learn.load("/content/drive/MyDrive/model1")

And get the following error:
RuntimeError: Error(s) in loading state_dict for Sequential:
Missing key(s) in state_dict: “0.0.weight”, “0.1.weight”, “0.1.bias”, “0.1.running_mean”, “0.1.running_var”, “0.4.0.conv1.weight”, “0.4.0.bn1.weight”, “0.4.0.bn1.bias”, “0.4.0.bn1.running_mean”, “0.4.0.bn1.running_var”, “0.4.0.conv2.weight”, “0.4.0.bn2.weight”, “0.4.0.bn2.bias”, “0.4.0.bn2.running_mean”, “0.4.0.bn2.running_var”, “0.4.0.conv3.weight”, “0.4.0.bn3.weight”, “0.4.0.bn3.bias”, “0.4.0.bn3.running_mean”, “0.4.0.bn3.running_var”, “0.4.0.downsample.0.weight”, “0.4.0.downsample.1.weight”, “0.4.0.downsample.1.bias”, “0.4.0.downsample.1.running_mean”, “0.4.0.downsample.1.running_var”, “0.4.1.conv1.weight”, “0.4.1.bn1.weight”, “0.4.1.bn1.bias”, “0.4.1.bn1.running_mean”, “0.4.1.bn1.running_var”, “0.4.1.conv2.weight”, “0.4.1.bn2.weight”, “0.4.1.bn2.bias”, “0.4.1.bn2.running_mean”, “0.4.1.bn2.running_var”, “0.4.1.conv3.weight”, “0.4.1.bn3.weight”, “0.4.1.bn3.bias”, “0.4.1.bn3.running_mean”, “0.4.1.bn3.running_var”, “0.4.2.conv1.weight”, “0.4.2.bn1.weight”, “0.4.2.bn1.bias”, “0.4.2.bn1.running_mean”, “0.4.2.bn1.running_var”, “0.4.2.conv2.weight”, “0.4.2.bn2.weight”, “0.4.2.bn2.bias”, “0.4.2.bn2.running_mean”, “0.4.2.bn2.running_var”, “0.4.2.conv3.weight”, “0.4.2.bn3.weight”, “0.4.2.bn3.bias”, “0.4.2.bn3.running_mean”, “0.4.2.bn3.running_var”, “0.5.0.conv1.weight”, “0.5.0.bn1.weight”, “0.5.0.bn1.bias”, “0.5.0.bn1.running_mean”, “0.5.0.bn1.running_var”, “0.5.0.conv2.weight”, “0.5.0.bn2.weight”, “0.5.0.bn2.bias”, “0.5.0.bn2.running_mean”, “0.5.0.bn2.running_var”, "0.5.0.conv3…
Unexpected key(s) in state_dict: “i_h.weight”, “rnn.weight_ih_l0”, “rnn.weight_hh_l0”, “rnn.bias_ih_l0”, “rnn.bias_hh_l0”, “rnn.weight_ih_l1”, “rnn.weight_hh_l1”, “rnn.bias_ih_l1”, “rnn.bias_hh_l1”, “h_o.weight”, “h_o.bias”.

I know that I probably just don’t understand whats happening behind the scenes. Cloud anyone give me a hint?

Try using learn.save instead of learn.export. With the first function, a .pth file is created, which is basically only the weights. learn.load can load these weights into an existing architecture. learn.export saves not only the weights but pickles the whole learner (creating a .pkl file), including the DataLoader, callbacks, etc. You want to use this for inference.

You probably got the error because the architecture from the old project is a bit different from your current architecture. PyTorch is very sensitive in this matter and will throw an error, even if only one of the thousands of weights is wrong.

Also, check your training for performance bottlenecks. 8 hours for an epoch is very long. I’ve trained models in ImageNet, and an epoch took about 40 minutes for large models. Unless you have a dataset much larger than ImageNet, there is probably room to optimize your training.
In my experience, a common bottleneck is resizing operations. You seem to be training on x-rays which are high resolution, like 3000 x 4000 px. Consider resizing those to a more convenient format, such as 512x512, and saving it on disk. This usually speeds up my training by the factor of 10 or more.

2 Likes

Hi joao hope all is well!



After you have caried out BresNet’s recommendations you may find the above links useful and there some posts on this site and medium, I have found them useful when training models on colab that take a long time.

Cheers mrfabulous1 :smiley: :smiley:

2 Likes

Thanks, man! You were right the time, it dropped significantly, thanks for the help!

1 Like