CUDA Run error

Gaurav85 · March 4, 2018, 3:03pm

Hi,
I am getting the below error when I run the learn.fit()

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCTensorCopy.c:65

The data is Kaggle Humpback Whale competition.

I am following the dogbreed as well as planet notebook for this data.

Also, in Dogbreed classification, how are we specifying the no of classes to the model ?

Gaurav85 · March 5, 2018, 12:36pm

Figured out about how classes are passed in ImageDataClassifier.from_csv.

But still notable to figure out the CUDA error.

Is it related to default metric used in learn.fit() ?

how to pass a different metric of accuracy?

SHAR1 · March 5, 2018, 3:59pm

check your graphics memory usage by, nvidia-smi command.
You can run this on the ssh terminal. If that’s occupied for running you jupyter notebook, try ‘tmux’ . Or, start a terminal form your notebook itself(from ‘Home’ tab).
If the gpu’s memory is used up completely, then reducing the batch size or image size will help.

btw, the error isn’t visible. If you can show a code snippet, it will be easier to debug.

wgpubs · March 5, 2018, 6:02pm

What are you running this on?

The latest version of pytorch no longer supports certain older graphic cards, or it could be that you just need to restart your kernel or machine.

If you want to know what is going on with your GPU, run: watch -n 1 nvidia-smi

This will automatically poll your GPU and getll you what is going on every second.

Gaurav85 · March 7, 2018, 2:32pm

I am running this on AWS EC2 fastai instance.

The actual error in the end of all the message is

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503965122592/work/torch/lib/THC/generic/THCTensorCopy.c:65

I will try out your suggestions.

Thanks for the response.

freekeys · March 31, 2018, 8:20am

Hi All - I’m also getting this error and have tried reducing the batch size & image size with no luck. I’m using the “whale-categorization-playground” dataset from Kaggle and get this error when I run the line below with bs=26, sz=224 arch=reznext101_64.

Thank you!

learn = ConvLearner.pretrained(arch, data, precompute=True)

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-62-3bf7b4e76984> in <module>()
      1 # run first pass
----> 2 learn = ConvLearner.pretrained(arch, data, precompute=True)

~/fastai/courses/dl1/fastai/conv_learner.py in pretrained(cls, f, data, ps, xtra_fc, xtra_cut, custom_head, precompute, **kwargs)
    106     def pretrained(cls, f, data, ps=None, xtra_fc=None, xtra_cut=0, custom_head=None, precompute=False, **kwargs):
    107         models = ConvnetBuilder(f, data.c, data.is_multi, data.is_reg,
--> 108             ps=ps, xtra_fc=xtra_fc, xtra_cut=xtra_cut, custom_head=custom_head)
    109         return cls(data, models, precompute, **kwargs)
    110 

~/fastai/courses/dl1/fastai/conv_learner.py in __init__(self, f, c, is_multi, is_reg, ps, xtra_fc, xtra_cut, custom_head)
     47         else: fc_layers = self.get_fc_layers()
     48         self.n_fc = len(fc_layers)
---> 49         self.fc_model = to_gpu(nn.Sequential(*fc_layers))
     50         if not custom_head: apply_init(self.fc_model, kaiming_normal)
     51         self.model = to_gpu(nn.Sequential(*(layers+fc_layers)))

~/fastai/courses/dl1/fastai/core.py in to_gpu(x, *args, **kwargs)
     43 USE_GPU=True
     44 def to_gpu(x, *args, **kwargs):
---> 45     return x.cuda(*args, **kwargs) if torch.cuda.is_available() and USE_GPU else x
     46 
     47 def noop(*args, **kwargs): return

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in cuda(self, device)
    214             Module: self
    215         """
--> 216         return self._apply(lambda t: t.cuda(device))
    217 
    218     def cpu(self):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    144     def _apply(self, fn):
    145         for module in self.children():
--> 146             module._apply(fn)
    147 
    148         for param in self._parameters.values():

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
    150                 # Variables stored in modules are graph leaves, and we don't
    151                 # want to create copy nodes, so we have to unpack the data.
--> 152                 param.data = fn(param.data)
    153                 if param._grad is not None:
    154                     param._grad.data = fn(param._grad.data)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/module.py in <lambda>(t)
    214             Module: self
    215         """
--> 216         return self._apply(lambda t: t.cuda(device))
    217 
    218     def cpu(self):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/_utils.py in _cuda(self, device, async)
     67         else:
     68             new_type = getattr(torch.cuda, self.__class__.__name__)
---> 69             return new_type(self.size()).copy_(self, async)
     70 
     71 

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCTensorCopy.c:20

alecrubin · April 2, 2018, 8:08am

I’m having the exact same issue, on the same dataset. Have you had any luck figuring this out yet?

freekeys · April 2, 2018, 8:39am

Nope no luck so far @alecrubin (somehow it’s good to hear that someone else is experiencing the same thing so thanks for posting! ) - any one have any ideas for how we can resolve this?

shaun1 · April 2, 2018, 10:50am

I’m getting the same error on a different dataset, where I’m using Columnar data. I’m running it on a paperspace fastai instance. The weird part is sometimes I get a CUDA mapping error. Here is my original post.

alecrubin · April 2, 2018, 4:37pm

Are you using the ImageClassifierData.from_csv method too? I think I’m going to try reading the CSV file, moving the images to labeled directories, and then using the ImageClassifierData.from_paths method to load in the data. Not the most elegant solution, but i’ll let you know how it goes.

feribg · April 4, 2018, 6:50am

This is broken for me too in the latest fastai

alecrubin · April 5, 2018, 6:22am

I tried it on the whales dataset parsing the csv manually, moving images into directories and using from_paths on the data loader, and everything is working. Must be an issue with the from_csv method.

sayko · April 9, 2018, 5:56am

what would be more efficient? to reduce first bs, or sz?

SHAR1 · April 9, 2018, 9:23am

I would recommend to change the ‘bs’ batch size first as it wouldn’t affect the performance(accuracy) of your model, but changing the size ‘sz’ will. Choosing the right value of ‘bs’ is depended upon the capacity of your graphics card(like, how much data can a gpu handle at an instant, for example, 12Gb) and also the size of your input data(like, the batch you choose should have the same distribution as the input dataset. This can actually affect your accuracy.) . And, when it comes to ‘sz’, its mostly depended upon the architecture that you use.

devon.kaberna · May 1, 2018, 2:31am

When I set val_idxs = 0, I am able to call learn.fit() with no error now. Also, there is another thread on this topic.

heyrict · October 16, 2018, 10:29am

I have the same error with a different trace back. After trial and error for several hours, I solved my problem by mapping label column to np.int64.
Hope this can save someone else having the same problem with me.

shahnoza · November 20, 2018, 10:25am

Hi @heyrict, I am new here, could you please eleborate more about mapping label column to np.int64 . Does it mean you changed some code line? Which one

Thank you!
Shahnoza

heyrict · November 20, 2018, 11:10am

Suppose you have a dataframe like this:

|X|y|
|I am a sentence|0|
|I am another sentence|1|
|…|…|
|I am yet another sentence|0|

Since you use a large dataset, the file reader may not correctly parse y as a column with data type of 64-byte integer (or np.int64). Let’s say np.float32.

Suppose the dataframe is df, if you call df["y"], you will see an output like this:

0    0
1    1
2    0
... ...
n    0
Name: y, dtype: float32

If you use functions with dataframes as a parameter, map the label column (here column y) to np.int64 and pass it on.

import pandas as pd, numpy as np
df = pd.read_csv("foo.csv")
df["y"] = df["y"].map(np.int64)
learn = some_function(train=df, **kwargs)

If you use function with folder structure, map the label column and save it before you run the learning procedure.

df = pd.read_csv("foo.csv")
df["y"] = df["y"].map(np.int64)
df.to_csv("bar.csv", index=False)

shahnoza · November 20, 2018, 2:45pm

@heyrict Thank you for the detailed answer!