Cuda runtime error 59 when recreating Lesson 1 with my own data [SOLVED]

maxim.pechyonkin · May 21, 2018, 9:12am

I decided to try and build an image classifier just like in Lesson 1 by using my own data captured with a camera.
I only have 2 classes and I have successfully recreated the structure of folders:

data
  train
   class1
   class2
  valid
   class1
   class2

Then, I am just following the steps from Lesson 1 without changing anything except the PATH variable and I keep getting cuda runtime error (59) when I run this code:

arch = resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz), bs=8)
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 2)

The error is somewhere at a deep level and I clearly lack the understanding of internal workings of the library. So please help me if anyone has encountered a similar problem.
By the way, I got the same error when using the Kaggle Whale Competition data, and also I don’t know what the problem is there.

Here is the error log:

----------------------------------------------------------------------
RuntimeError                         Traceback (most recent call last)
<ipython-input-15-3033121109a3> in <module>()
      3 # pdb.set_trace()
      4 learn = ConvLearner.pretrained(arch, data, precompute=True)
----> 5 learn.fit(0.01, 2)

~/fastai/courses/dl1/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
    212         self.sched = None
    213         layer_opt = self.get_layer_opt(lrs, wds)
--> 214         return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    215 
    216     def warm_up(self, lr, wds=None):

~/fastai/courses/dl1/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, **kwargs)
    159         n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
    160         return fit(model, data, n_epoch, layer_opt.opt, self.crit,
--> 161             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
    162 
    163     def get_layer_groups(self): return self.models.get_layer_groups()

~/fastai/courses/dl1/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, stepper, **kwargs)
    104             i += 1
    105 
--> 106         vals = validate(stepper, data.val_dl, metrics)
    107         if epoch == 0: print(layout.format(*names))
    108         print_stats(epoch, [debias_loss] + vals)

~/fastai/courses/dl1/fastai/model.py in validate(stepper, dl, metrics)
    125     for (*x,y) in iter(dl):
    126         preds,l = stepper.evaluate(VV(x), VV(y))
--> 127         loss.append(to_np(l))
    128         res.append([f(preds.data,y) for f in metrics])
    129     return [np.mean(loss)] + list(np.mean(np.stack(res),0))

~/fastai/courses/dl1/fastai/core.py in to_np(v)
     39     if isinstance(v, (list,tuple)): return [to_np(o) for o in v]
     40     if isinstance(v, Variable): v=v.data
---> 41     return v.cpu().numpy()
     42 
     43 USE_GPU=True

~/miniconda3/envs/fastai/lib/python3.6/site-packages/torch/tensor.py in cpu(self)
     43     def cpu(self):
     44         r"""Returns a CPU copy of this tensor if it's not already on the CPU"""
---> 45         return self.type(getattr(torch, self.__class__.__name__))
     46 
     47     def double(self):

~/miniconda3/envs/fastai/lib/python3.6/site-packages/torch/cuda/__init__.py in type(self, *args, **kwargs)
    394     def type(self, *args, **kwargs):
    395         with device(self.get_device()):
--> 396             return super(_CudaBase, self).type(*args, **kwargs)
    397 
    398     __new__ = _lazy_new

~/miniconda3/envs/fastai/lib/python3.6/site-packages/torch/_utils.py in _type(self, new_type, async)
     36     if new_type.is_sparse:
     37         raise RuntimeError("Cannot cast dense tensor to sparse tensor")
---> 38     return new_type(self.size()).copy_(self, async)
     39 
     40 

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCTensorCopy.c:70

Edit: my folders and files are as follows:

folders

Edit2: wierdly, I discovered the following information about training and validation data set labels. Can this be the problem?

> data.trn_ds.y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

> data.val_ds.y

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2])

Shouldn’t the labels be the same for both data sets?

maxim.pechyonkin · May 21, 2018, 9:45am

I have found the problem – there was an additional unwanted folder inside the valid folder. I removed it and everything works alright.

P.S. I wonder if there is a similar problem with the Whales Kaggle competition, because I had a similar cuda runtime error (59) when doing that.

brtknr · May 21, 2018, 9:46am

Yes, the labels should be the same. Check there are no spaces in the label names.

Re cuda, which graphics card do you have and which version of CUDA do you have installed? Usually helpful if you post relevant information like that… Also try running the notebook on Kaggle kernel with GPU enabled and see if you still get the same issue.

maxim.pechyonkin · May 21, 2018, 9:48am

The folder structure was incorrect. I don’t think GPU version is relevant here, because fixing the folders helped.