Problem with moving the computed network between computers

pryb · July 29, 2018, 7:34pm

Hi,

I’m trying to move a model from one machine to another and got stuck.

I have a Paperspace VM, where I calculate models for image classification (according to lesson 1).

I want to transfer the computed model to my local laptop (without GPU).

I transfered the model file (.h5) to my laptop, where I want to classify the images.

The code I use for the load is
PATH = “/home/pryb/data/brick/”
sz=224

arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=False)
learn.load(‘224_brick_last’)

When I call learn.load, I get:

Traceback (most recent call last):
File “/home/pryb/anaconda3/envs/fastai-cpu2/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 514, in load_ state_dict
own_state[name].copy_(param)
RuntimeError: inconsistent tensor size, expected tensor [5 x 512] and src [10 x 512] to have the same number of elements, but got 2560 and 5120 elements respectively at /opt/conda/conda-bld/pytorch_1523244252089/work/torch/lib/TH/generic/THTe nsorCopy.c:86

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “server.py”, line 79, in
(learn, val_tfms) = initNeuralNetwork()
File “server.py”, line 27, in initNeuralNetwork
learn.load(‘224_brick_last’)
File “/home/pryb/fastai/fastai/learner.py”, line 107, in load
load_model(self.model, self.get_model_path(name))
File “/home/pryb/fastai/fastai/torch_imports.py”, line 40, in load_model
m.load_state_dict(sd)
File “/home/pryb/anaconda3/envs/fastai-cpu2/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 519, in load_ state_dict
.format(name, own_state[name].size(), param.size()))
RuntimeError: While copying the parameter named 16.weight, whose dimensions in the model are torch.Size([5, 512]) and who se dimensions in the checkpoint are torch.Size([10, 512]).

I have the latest version from github on both machines. I have no problem with this code on the originating (paperspace) machine.

I tried to search the forum for the exception, but only found the advices about setting “precompute” to false - didn’t help. What am I doing wrong?

Pavel

stephenjohnson · July 29, 2018, 9:49pm

Just a wild guess on my part but the neural network topology is built from the paths data not during learn.load, so do you have the same data set in paperspace as locally, that is, do you have the same number of classes (subfolders) in your dataset. It seems like you might have 10 classes in the paperspace dataset but only 5 in your local dataset (or vice versa). As I said, it’s a wild guess.

pryb · July 30, 2018, 5:09am

Good guess. It was exactly the problem. I didn’t realize that and had older dataset locally.

Thanks a lot.

stephenjohnson · July 30, 2018, 5:33am

That’s great!