CUDA Run error


(Gaurav Desai) #1

Hi,
I am getting the below error when I run the learn.fit()

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCTensorCopy.c:65

The data is Kaggle Humpback Whale competition.

I am following the dogbreed as well as planet notebook for this data.

Also, in Dogbreed classification, how are we specifying the no of classes to the model ?


(Gaurav Desai) #2

Figured out about how classes are passed in ImageDataClassifier.from_csv.

But still notable to figure out the CUDA error.

Is it related to default metric used in learn.fit() ?

how to pass a different metric of accuracy?


(Sharwon Pius) #3

check your graphics memory usage by, nvidia-smi command.
You can run this on the ssh terminal. If that’s occupied for running you jupyter notebook, try ‘tmux’ . Or, start a terminal form your notebook itself(from ‘Home’ tab).
If the gpu’s memory is used up completely, then reducing the batch size or image size will help.

btw, the error isn’t visible. If you can show a code snippet, it will be easier to debug.


(WG) #4

What are you running this on?

The latest version of pytorch no longer supports certain older graphic cards, or it could be that you just need to restart your kernel or machine.

If you want to know what is going on with your GPU, run: watch -n 1 nvidia-smi

This will automatically poll your GPU and getll you what is going on every second.


(Gaurav Desai) #5

I am running this on AWS EC2 fastai instance.

The actual error in the end of all the message is

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCTensorCopy.c:65

I will try out your suggestions.

Thanks for the response.


#6

Hi All - I’m also getting this error and have tried reducing the batch size & image size with no luck. I’m using the “whale-categorization-playground” dataset from Kaggle and get this error when I run the line below with bs=26, sz=224 arch=reznext101_64.

Thank you!

learn = ConvLearner.pretrained(arch, data, precompute=True)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-62-3bf7b4e76984> in <module>()
      1 # run first pass
----> 2 learn = ConvLearner.pretrained(arch, data, precompute=True)

~/fastai/courses/dl1/fastai/conv_learner.py in pretrained(cls, f, data, ps, xtra_fc, xtra_cut, custom_head, precompute, **kwargs)
    106     def pretrained(cls, f, data, ps=None, xtra_fc=None, xtra_cut=0, custom_head=None, precompute=False, **kwargs):
    107         models = ConvnetBuilder(f, data.c, data.is_multi, data.is_reg,
--> 108             ps=ps, xtra_fc=xtra_fc, xtra_cut=xtra_cut, custom_head=custom_head)
    109         return cls(data, models, precompute, **kwargs)
    110 

~/fastai/courses/dl1/fastai/conv_learner.py in __init__(self, f, c, is_multi, is_reg, ps, xtra_fc, xtra_cut, custom_head)
     47         else: fc_layers = self.get_fc_layers()
     48         self.n_fc = len(fc_layers)
---> 49         self.fc_model = to_gpu(nn.Sequential(*fc_layers))
     50         if not custom_head: apply_init(self.fc_model, kaiming_normal)
     51         self.model = to_gpu(nn.Sequential(*(layers+fc_layers)))

~/fastai/courses/dl1/fastai/core.py in to_gpu(x, *args, **kwargs)
     43 USE_GPU=True
     44 def to_gpu(x, *args, **kwargs):
---> 45     return x.cuda(*args, **kwargs) if torch.cuda.is_available() and USE_GPU else x
     46 
     47 def noop(*args, **kwargs): return

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in cuda(self, device)
    214             Module: self
    215         """
--> 216         return self._apply(lambda t: t.cuda(device))
    217 
    218     def cpu(self):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    144     def _apply(self, fn):
    145         for module in self.children():
--> 146             module._apply(fn)
    147 
    148         for param in self._parameters.values():

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    150                 # Variables stored in modules are graph leaves, and we don't
    151                 # want to create copy nodes, so we have to unpack the data.
--> 152                 param.data = fn(param.data)
    153                 if param._grad is not None:
    154                     param._grad.data = fn(param._grad.data)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in <lambda>(t)
    214             Module: self
    215         """
--> 216         return self._apply(lambda t: t.cuda(device))
    217 
    218     def cpu(self):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/_utils.py in _cuda(self, device, async)
     67         else:
     68             new_type = getattr(torch.cuda, self.__class__.__name__)
---> 69             return new_type(self.size()).copy_(self, async)
     70 
     71 

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCTensorCopy.c:20

(Alec Rubin) #7

I’m having the exact same issue, on the same dataset. Have you had any luck figuring this out yet?


#8

Nope no luck so far @alecrubin (somehow it’s good to hear that someone else is experiencing the same thing so thanks for posting! :slight_smile: ) - any one have any ideas for how we can resolve this?


#9

I’m getting the same error on a different dataset, where I’m using Columnar data. I’m running it on a paperspace fastai instance. The weird part is sometimes I get a CUDA mapping error. Here is my original post.


(Alec Rubin) #10

Are you using the ImageClassifierData.from_csv method too? I think I’m going to try reading the CSV file, moving the images to labeled directories, and then using the ImageClassifierData.from_paths method to load in the data. Not the most elegant solution, but i’ll let you know how it goes.


(Feras) #11

This is broken for me too in the latest fastai


(Alec Rubin) #12

I tried it on the whales dataset parsing the csv manually, moving images into directories and using from_paths on the data loader, and everything is working. Must be an issue with the from_csv method.


#13

what would be more efficient? to reduce first bs, or sz?


(Sharwon Pius) #14

I would recommend to change the ‘bs’ batch size first as it wouldn’t affect the performance(accuracy) of your model, but changing the size ‘sz’ will. Choosing the right value of ‘bs’ is depended upon the capacity of your graphics card(like, how much data can a gpu handle at an instant, for example, 12Gb) and also the size of your input data(like, the batch you choose should have the same distribution as the input dataset. This can actually affect your accuracy.) . And, when it comes to ‘sz’, its mostly depended upon the architecture that you use.