Training a model from scratch: CIFAR 10

If you use cycle_save_name parameter when running fit like
learn.fit(lr, 3, cycle_len=1, cycle_save_name='model') it will actually save the model after each cycle with the cycle name at the end of the model fname like model_cyc_1 so you could load that back in the same way!

7 Likes

cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:87

import sys

sys.path.append('../../../fastai/')

import pandas as pd
import numpy as np
import torch
from torch.utils.data import *
from torch import nn

from fastai.models.cifar10.senet import SENet18
from fastai import conv_learner, dataset, dataloader
%matplotlib inline

from fastai.dataset import * 
from fastai.conv_learner import *

train = pd.read_json('../data/data/processed/train.json')

band1 = train.band_1.apply(lambda x: np.array(x).reshape(75, 75)) 

band2 = train.band_2.apply(lambda x: np.array(x).reshape(75, 75)) 

traindata = np.array(list(zip(band1, band2)))

np.random.seed(42)
train_idx = np.random.permutation(range(len(train)))[:1100]
val_idx = np.random.permutation(range(len(train)))[1100:]

traintensor = FloatTensor(traindata[train_idx]).cuda()
validtensor = FloatTensor(traindata[val_idx]).cuda()

traintarget = LongTensor(list(train.is_iceberg[train_idx])).cuda()
validtarget = LongTensor(list(train.is_iceberg[val_idx])).cuda()

train_ds = TensorDataset(traintensor, traintarget)
valid_ds = TensorDataset(validtensor, validtarget)

bs = 16
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)
valid_dl= DataLoader(valid_ds, batch_size=bs, shuffle=True)

data= ModelData(path='../data/data/processed/', trn_dl=train_dl, val_dl=valid_dl)

data.c = 2

data.is_multi = False

data.trn_ds.is_reg = False

bm = BasicModel(SENet18().cuda(), 'senet_cifar10')

learn = ConvLearner(data, bm)

learn.load('sen_32x32_8')

RuntimeError Traceback (most recent call last)
in ()
53 learn = ConvLearner(data, bm)
54
—> 55 learn.load(‘sen_32x32_8’)

~/fastai/fastai/learner.py in load(self, name)
61 def get_model_path(self, name): return os.path.join(self.models_path,name)+’.h5’
62 def save(self, name): save_model(self.model, self.get_model_path(name))
—> 63 def load(self, name): load_model(self.model, self.get_model_path(name))
64
65 def set_data(self, data): self.data_ = data

~/fastai/fastai/torch_imports.py in load_model(m, p)
20 def children(m): return m if isinstance(m, (list, tuple)) else list(m.children())
21 def save_model(m, p): torch.save(m.state_dict(), p)
—> 22 def load_model(m, p): m.load_state_dict(torch.load§)
23
24 def load_pre(pre, f, fn):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/serialization.py in load(f, map_location, pickle_module)
229 f = open(f, ‘rb’)
230 try:
–> 231 return _load(f, map_location, pickle_module)
232 finally:
233 if new_fd:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/serialization.py in _load(f, map_location, pickle_module)
377 unpickler = pickle_module.Unpickler(f)
378 unpickler.persistent_load = persistent_load
–> 379 result = unpickler.load()
380
381 deserialized_storage_keys = pickle_module.load(f)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/serialization.py in persistent_load(saved_id)
348 if root_key not in deserialized_objects:
349 deserialized_objects[root_key] = restore_location(
–> 350 data_type(size), location)
351 storage = deserialized_objects[root_key]
352 if view_metadata is not None:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/serialization.py in default_restore_location(storage, location)
83 def default_restore_location(storage, location):
84 for _, _, fn in _package_registry:
—> 85 result = fn(storage, location)
86 if result is not None:
87 return result

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/serialization.py in _cuda_deserialize(obj, location)
65 if location.startswith(‘cuda’):
66 device_id = max(int(location[5:]), 0)
—> 67 return obj.cuda(device_id)
68
69

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/_utils.py in _cuda(self, device, async)
56 if device is None:
57 device = -1
—> 58 with torch.cuda.device(device):
59 if self.is_sparse:
60 new_type = getattr(torch.cuda.sparse, self.class.name)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/cuda/init.py in enter(self)
126 self.prev_idx = torch._C._cuda_getDevice()
127 if self.prev_idx != self.idx:
–> 128 torch._C._cuda_setDevice(self.idx)
129
130 def exit(self, *args):

RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:87

I have no trouble loading resnext32x32_8 wights in AWS but senet is giving this error. Is it because of on which GPU device the model was trained before ?

Whenever I try to import SENet it throws an error…
Why ?

@kcturgutlu and @ecdrid I was getting this same error loading the SENet18 weight file, so that was one of the reasons I re-trained CIFAR10 from scratch. I can make my weight file available for both of you to download to see if it works.

However, jeremy also just made another version of SENet available (SENet34) and that particular weight file does work for me so you could also just try that one directly instead of the SENet18 version. Here is where Jeremy shared that weight file:

Make sure if you use the SENet34 version you’ll need to update all of the import statements to reflect this (just switch out everywhere you see 18 with 34)

3 Likes

That is going to be very nice to know going forward. I was just going to change so everything was one epoch and save but this is way better. Thanks for pointing that out.

1 Like

Thanks I am able to load SENet34 now but getting this error TypeError: batch must contain numbers, dicts or lists; found <class ‘torch.FloatTensor’>. Probably because I am using a custom data from pytorch dataloader.

Capture

Can someone explain this weird behaviour?

Badly OverFitting?

To add more details-

I have simply re-run my previous day’s code(it had an accuracy of .93 , but don’t know what’s wrong with it now?

What can be accounted for this randomness?

We can’t provide useful help without knowing the details of what you’ve done. Some tips here: http://wiki.fast.ai/index.php/How_to_ask_for_Help . Sharing a gist of your notebook is often a good idea.

Hmmm… My guess is that pytorch stores the device ID of the GPU in the file. So maybe I trained on device 1 on my system, but you loaded it with device 0 on your system.

If anyone is interested in helping fix this, I think you need to:

  • Try saving a model with senet18 yourself (doesn’t need to be trained at all)
  • Use torch.load to load that model, and to load the saved model I provided
  • That will return an OrderedDict (IIRC). Check the difference between the 2 sets of keys. If my guess is right, the first number will be 0 on one and 1 on the other, or something like that
  • If that’s the case, try to rename the keys in my weights to match the same form as the one you saved
  • Save that new weight file, and see if it works!

If that works, then a great help would be to try to wrap all that into a function e.g. change_file_gpuid(weights_file_in, weights_file_out, new_gpuid) which replaces the GPU ID in a weights file with some ID you choose. Then any of us could use that to fix up a weights file trained on a different ID to what we need…

2 Likes

Here’s the Notebook(using nbviewer)

@jeremy, I manage to load the weights from the file after doing a manual hack on this function :grinning:

def load_model(m, p): m.load_state_dict(torch.load(p, map_location={‘cuda:1’:‘cuda:0’}))

The OrderedDict was 1 in the file.

Next step would be wrap this into a new function to change gpuid.

2 Likes

That is very helpful!

@jeremy, @jakcycsl - You might find this post in PyTorch very useful -

Docs: http://pytorch.org/docs/master/torch.html?highlight=load#torch.load
Examples Section for torch.load:

>>> torch.load('tensors.pt')
# Load all tensors onto the CPU
>>> torch.load('tensors.pt', map_location=lambda storage, loc: storage)
# Load all tensors onto GPU 1
>>> torch.load('tensors.pt', map_location=lambda storage, loc: storage.cuda(1))
# Map tensors from GPU 1 to GPU 0
>>> torch.load('tensors.pt', map_location={'cuda:1':'cuda:0'})

May be you might want to always do -

torch.load('tensors.pt', map_location=lambda storage, loc: storage)
and then take it to GPU in your own environment?

2 Likes

Yeah I’m thinking I should make that change in fastai. I can’t think of any downside offhand, can you?

Yeah, it makes sense to keep it as default. There might be a slight overhead of loading to CPU then taking to GPU, but that’s only onetime in the load process and might be happening internally anyway.

We just have to test out a few scenarios:

  1. Model is already in GPU, we load the weights with this method, does it cause any error or does it automatically move to the GPU model is in.
  2. Model is in CPU, load with this method and move to GPU. This I expect to have no issues.

Just tested using the latest commit with loading the weights to CPU directly, works perfectly fine.

Is it odd to get 98% validation accuracy for using these weight for cifar-10.

Best I’ve got is a bit under 96%. I think 98% would be a new SoTA. So yes :slight_smile:

@mmr If you use the pre-trained weights from CIFAR10 to train again on CIFAR10 then you most likely have a validation set that was already trained on (your model has already seen the validation set) so hence the accuracy is not a reliable metric. Unless you used the exact same validation set as the pre-trained weights were trained with.

1 Like

Well, I am running the same code you uploaded - and my results are little bit different.

1 Like