CUDA runtime error(59) in kaggle whale competition


(alex.huang) #1

HI there,

I ran into a CUDA problem in kaggle Humpback Whale Identification Challenge. I write the code under the step of lesson 1 & 2. In the end of one epoch, it throw out a CUDA error. I have try to set metrics=None, and tune the bs & sz, but they didn’t work.

For anyone bold enough to have read this far, any ideas on what I may have pooched?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-21752d933788> in <module>()
----> 1 learn.fit(lr, 1)

~/kaggle/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
    213         self.sched = None
    214         layer_opt = self.get_layer_opt(lrs, wds)
--> 215         return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    216 
    217     def warm_up(self, lr, wds=None):

~/kaggle/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, **kwargs)
    160         n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
    161         return fit(model, data, n_epoch, layer_opt.opt, self.crit,
--> 162             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
    163 
    164     def get_layer_groups(self): return self.models.get_layer_groups()

~/kaggle/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, stepper, **kwargs)
    104             i += 1
    105 
--> 106         vals = validate(stepper, data.val_dl, metrics)
    107         if epoch == 0: print(layout.format(*names))
    108         print_stats(epoch, [debias_loss] + vals)

~/kaggle/fastai/model.py in validate(stepper, dl, metrics)
    125     for (*x,y) in iter(dl):
    126         preds,l = stepper.evaluate(VV(x), VV(y))
--> 127         loss.append(to_np(l))
    128         res.append([f(preds.data,y) for f in metrics])
    129     return [np.mean(loss)] + list(np.mean(np.stack(res),0))

~/kaggle/fastai/core.py in to_np(v)
     38     if isinstance(v, (list,tuple)): return [to_np(o) for o in v]
     39     if isinstance(v, Variable): v=v.data
---> 40     return v.cpu().numpy()
     41 
     42 USE_GPU=True

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/tensor.py in cpu(self)
     43     def cpu(self):
     44         r"""Returns a CPU copy of this tensor if it's not already on the CPU"""
---> 45         return self.type(getattr(torch, self.__class__.__name__))
     46 
     47     def double(self):

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/cuda/__init__.py in type(self, *args, **kwargs)
    394     def type(self, *args, **kwargs):
    395         with device(self.get_device()):
--> 396             return super(_CudaBase, self).type(*args, **kwargs)
    397 
    398     __new__ = _lazy_new

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/_utils.py in _type(self, new_type, async)
     36     if new_type.is_sparse:
     37         raise RuntimeError("Cannot cast dense tensor to sparse tensor")
---> 38     return new_type(self.size()).copy_(self, async)
     39 
     40 

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCTensorCopy.c:70

My code as blow:

arch = resnet34
bs = 64
PATH = 'data/whale'

def mvp5(preds, targs):
    preds = np.exp(preds)
    min5 = np.sort(preds)[:, :5]
    return np.mean(min5)

metrics = [mvp5]

def get_data(sz):
    tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.05)
    return ImageClassifierData.from_csv(PATH, 'train', f'{PATH}/train.csv', bs=bs, val_idxs=val_idxs,
                                        tfms=tfms, test_name='test')

sz = 224
data = get_data(sz)

len(data.val_ds.fnames), len(data.val_ds.y)
(1970, 1970)

learn = ConvLearner.pretrained(arch, data, metrics=metrics)
learn.lr_find()
learn.sched.plot()
lr = 0.4
learn.fit(lr, 1)    # ERROR

Kaggle's Whale Competition
(alex.huang) #2

forgot to mention, I run the code on AWS p2 instance, I have update fastai library to the latest, and in addition, before the ERROR happen, an expection threw, but I didn’t stop the training:

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py", line 144, in run
    for instance in self.tqdm_cls._instances:
  File "/home/ubuntu/src/anaconda3/envs/fastai/lib/python3.6/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration

#3

I’m also having this problem, on the lesson 8 notebook though.


#4

Did you find a solution to this problem already?


#5

I also got the cuda runtime error (59) in my Humpback Whales notebook.

Here is the code:

Path to data

PATH = “./data/whales/”

Image size and batch size

sz=224
bs = 64

network architecture

arch=resnet34

Transforms for image augmentation

tfms = tfms_from_model(arch, sz, aug_tfms=transforms_side_on, max_zoom=1.3)

Create data object; Note file identifier already has ‘.jpg’ suffix

data = ImageClassifierData.from_csv(PATH, csv_fname=’./data/whales/train.csv’,
folder=‘train’,test_name=‘test’,suffix=’’,
bs=bs,tfms=tfms_from_model(arch, sz))

Precompute weights

learn = ConvLearner.pretrained(arch, data, precompute=True)

Fit the model

learn.fit(lrs=.01, n_cycle=1)

The progress bar widget shows that it gets halfway through the epoch before failing on the last step:

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCTensorCopy.c:20

If I run it again, it immediately throws the same error before the progress bar shows.

Full error message included below:

RuntimeError Traceback (most recent call last)
in ()
1 # Fit the model
2 # Learning rate 0.01, 3 epoch
----> 3 learn.fit(lrs=.1, n_cycle=1)

~/fastai/courses/dl1/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
285 self.sched = None
286 layer_opt = self.get_layer_opt(lrs, wds)
–> 287 return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
288
289 def warm_up(self, lr, wds=None):

~/fastai/courses/dl1/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
232 metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
233 swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
–> 234 swa_eval_freq=swa_eval_freq, **kwargs)
235
236 def get_layer_groups(self): return self.models.get_layer_groups()

~/fastai/courses/dl1/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, **kwargs)
148
149 if not all_val:
–> 150 vals = validate(model_stepper, cur_data.val_dl, metrics)
151 stop=False
152 for cb in callbacks: stop = stop or cb.on_epoch_end(vals)

~/fastai/courses/dl1/fastai/model.py in validate(stepper, dl, metrics)
211 if isinstance(x,list): batch_cnts.append(len(x[0]))
212 else: batch_cnts.append(len(x))
–> 213 loss.append(to_np(l))
214 res.append([f(preds.data,y.data) for f in metrics])
215 return [np.average(loss, 0, weights=batch_cnts)] + list(np.average(np.stack(res), 0, weights=batch_cnts))

~/fastai/courses/dl1/fastai/core.py in to_np(v)
42 if isinstance(v, Variable): v=v.data
43 if isinstance(v, torch.cuda.HalfTensor): v=v.float()
—> 44 return v.cpu().numpy()
45
46 IS_TORCH_04 = LooseVersion(torch.version) >= LooseVersion(‘0.4’)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/tensor.py in cpu(self)
43 def cpu(self):
44 r""“Returns a CPU copy of this tensor if it’s not already on the CPU”""
—> 45 return self.type(getattr(torch, self.class.name))
46
47 def double(self):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/cuda/init.py in type(self, *args, **kwargs)
394 def type(self, *args, **kwargs):
395 with device(self.get_device()):
–> 396 return super(_CudaBase, self).type(*args, **kwargs)
397
398 new = _lazy_new

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/_utils.py in type(self, new_type, async)
36 if new_type.is_sparse:
37 raise RuntimeError(“Cannot cast dense tensor to sparse tensor”)
—> 38 return new_type(self.size()).copy
(self, async)
39
40

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCTensorCopy.c:70

Cheers,
Joe


(Ryder) #6

Same error here too! Darn, I was exciting to try this out on my first kaggle entry. Whale competition as well.

-------------------------------------------------------------------------
RuntimeError                            Traceback (most recent call last)
<timed eval> in <module>()

~/fastai/courses/dl1/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
    207         self.sched = None
    208         layer_opt = self.get_layer_opt(lrs, wds)
--> 209         return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    210 
    211     def warm_up(self, lr, wds=None):

~/fastai/courses/dl1/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, use_clr, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, **kwargs)
    154         n_epoch = sum_geom(cycle_len if cycle_len else 1, cycle_mult, n_cycle)
    155         return fit(model, data, n_epoch, layer_opt.opt, self.crit,
--> 156             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, **kwargs)
    157 
    158     def get_layer_groups(self): return self.models.get_layer_groups()

~/fastai/courses/dl1/fastai/model.py in fit(model, data, epochs, opt, crit, metrics, callbacks, **kwargs)
    104             i += 1
    105 
--> 106         vals = validate(stepper, data.val_dl, metrics)
    107         if epoch == 0: print(layout.format(*names))
    108         print_stats(epoch, [debias_loss] + vals)

~/fastai/courses/dl1/fastai/model.py in validate(stepper, dl, metrics)
    125     for (*x,y) in iter(dl):
    126         preds,l = stepper.evaluate(VV(x), VV(y))
--> 127         loss.append(to_np(l))
    128         res.append([f(preds.data,y) for f in metrics])
    129     return [np.mean(loss)] + list(np.mean(np.stack(res),0))

~/fastai/courses/dl1/fastai/core.py in to_np(v)
     35     if isinstance(v, (list,tuple)): return [to_np(o) for o in v]
     36     if isinstance(v, Variable): v=v.data
---> 37     return v.cpu().numpy()
     38 
     39 USE_GPU=True

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THC/generic/THCTensorCopy.c:70```

(Ryder) #7

Here’s the error with CUDA_LAUNCH_BLOCKING=1 jupyter notebook

-------------------------------------------------------------------------
RuntimeError                            Traceback (most recent call last)
<timed eval> in <module>()

~/fastai/courses/dl1/fastai/learner.py in fit(self, lrs, n_cycle, wds, **kwargs)
    285         self.sched = None
    286         layer_opt = self.get_layer_opt(lrs, wds)
--> 287         return self.fit_gen(self.model, self.data, layer_opt, n_cycle, **kwargs)
    288 
    289     def warm_up(self, lr, wds=None):

~/fastai/courses/dl1/fastai/learner.py in fit_gen(self, model, data, layer_opt, n_cycle, cycle_len, cycle_mult, cycle_save_name, best_save_name, use_clr, use_clr_beta, metrics, callbacks, use_wd_sched, norm_wds, wds_sched_mult, use_swa, swa_start, swa_eval_freq, **kwargs)
    232             metrics=metrics, callbacks=callbacks, reg_fn=self.reg_fn, clip=self.clip, fp16=self.fp16,
    233             swa_model=self.swa_model if use_swa else None, swa_start=swa_start,
--> 234             swa_eval_freq=swa_eval_freq, **kwargs)
    235 
    236     def get_layer_groups(self): return self.models.get_layer_groups()

~/fastai/courses/dl1/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, **kwargs)
    157 
    158         if not all_val:
--> 159             vals = validate(model_stepper, cur_data.val_dl, metrics)
    160             stop=False
    161             for cb in callbacks: stop = stop or cb.on_epoch_end(vals)

~/fastai/courses/dl1/fastai/model.py in validate(stepper, dl, metrics)
    215     with no_grad_context():
    216         for (*x,y) in iter(dl):
--> 217             preds, l = stepper.evaluate(VV(x), VV(y))
    218             if isinstance(x,list): batch_cnts.append(len(x[0]))
    219             else: batch_cnts.append(len(x))

~/fastai/courses/dl1/fastai/model.py in evaluate(self, xs, y)
     77         preds = self.m(*xs)
     78         if isinstance(preds,tuple): preds=preds[0]
---> 79         return preds, self.crit(preds, y)
     80 
     81 def set_train_mode(m):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce)
   1330                          .format(input.size(0), target.size(0)))
   1331     if dim == 2:
-> 1332         return torch._C._nn.nll_loss(input, target, weight, size_average, ignore_index, reduce)
   1333     elif dim == 4:
   1334         return torch._C._nn.nll_loss2d(input, target, weight, size_average, ignore_index, reduce)

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1524586445097/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:116

(Maxim Pechyonkin) #8

This error is very weird. I also got it. And so far I haven’t been able to find a solution. Can anyone try to use Lesson 1 notebook for Whale Competition?


(Maxim Pechyonkin) #9

A possible reason can be mismatched labels for training, test and validation data sets, as discussed in this thread.


#10

Hi guys!

I’m also having this issue. I’ve found that if you remove the validation part it works! (just set val_idxs = [0] in the ImageClassifierData.from_csv args)

However, my accuracy in the training set is 0.0 after 3 epochs. :face_with_raised_eyebrow:

Any ideas?


#11

Hey guys,

I encountered the same problem and I think I found a work around.

Basically we got this error because in the train.csv file they are whale ids that correspond to “new_whale” and are basically not corresponding to any classes. What I did was removing all the rows which have “new_whale” in column “Id”

label_csv = f'{PATH}train.csv'
train_df = pd.read_csv(label_csv)
df_without_new_whale = train_df.loc[train_df['Id'] != 'new_whale']
df_without_new_whale.to_csv(f'{PATH}train_without_new_whale.csv', index=False)

But then I only have an accuracy of 6% :joy:


#12

I’m not very good at python so I may be wrong in my interpretation of your solution, but how is ‘new_whale’ not it’s own class? It lies in the csv file in the same column as the rest of the whale ids so why is it treated differently?


#13

You are right but ‘new_whale’ represent 8% of the dataset so you can flag every whale with ‘new_whale’.

‘new_whale’ is not a class like the others it’s more a default classification, check out top 3 kernels in kaggle to have more informations.

My best score without too much data augmentation is 11%


(Tony Hung) #14

I’m running into the same issue for the Humpback Whale Identification Challenge as well


(Morgan McGuire) #15

Posted here, I got around the problem by converting the image labels (Ids) to integers, solution courtesy of @jamesrees, see here:


(Tony Hung) #16

What worked for me was to not set val_idxs when calling ImageClassifierData.from_csv

f_model = resnet34
sz = 224
bs = 64


tfms = tfms_from_model(f_model, sz, aug_tfms=transforms_side_on, max_zoom=1.1)
data = ImageClassifierData.from_csv(PATH, 'train', PATH+'/train.csv', tfms=tfms, bs=bs, test_name='test')

this at least gets me around that CUDA error…


#17

Has anyone figured out what the real source of the problem was?

If it’s with the labels being classified as objects, then why do we not have that problem with the dog breeds dataset?

Or is it, as someone mentioned earlier, because the validation set contains classes that are not in the training set, due to having small classes?