RuntimeError: CUDA error: an illegal memory access was encountered

moon_knight · March 18, 2022, 9:07am

When I am running following code on Gradient, it is working fine but it is throwing me error after running for few seconds in 1st epoch in Kaggle notebook. When I am using bs=32 then it is running fine in Kaggle notebook as well.

Code
from fastai.text.all import *
dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid=‘test’, bs=16)
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(4, 1e-2)

Error
RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Can anyone please help me out?

msivanes · March 19, 2022, 1:28am

Could you confirm the pytorch version?

Some of the workarounds suggested were a) lower batch sizes b) setting specific gpu torch.cuda.set_device(1)

This seems like a pytorch issue. It is not clear to me that the issue is completely fixed.

github.com/pytorch/pytorch

RuntimeError: CUDA error: an illegal memory access was encountered

opened 07:55AM - 15 Jun 19 UTC

closed 06:01PM - 02 Oct 20 UTC

xiaoxiangyeyuwangye

module: cuda triaged

Hi,everyone! I met a strange illegal memory access error. It happens randomly w…ithout any regular pattern. The code is really simple. It is PointNet for point cloud segmentation. I don't think there is anything wrong in the code. ```python import torch import torch.nn as nn import torch.nn.functional as F import os class InstanceSeg(nn.Module): def __init__(self, num_points=1024): super(InstanceSeg, self).__init__() self.num_points = num_points self.conv1 = nn.Conv1d(9, 64, 1) self.conv2 = nn.Conv1d(64, 64, 1) self.conv3 = nn.Conv1d(64, 64, 1) self.conv4 = nn.Conv1d(64, 128, 1) self.conv5 = nn.Conv1d(128, 1024, 1) self.conv6 = nn.Conv1d(1088, 512, 1) self.conv7 = nn.Conv1d(512, 256, 1) self.conv8 = nn.Conv1d(256, 128, 1) self.conv9 = nn.Conv1d(128, 128, 1) self.conv10 = nn.Conv1d(128, 2, 1) self.max_pool = nn.MaxPool1d(num_points) def forward(self, x): batch_size = x.size()[0] # (x has shape (batch_size, 9, num_points)) out = F.relu(self.conv1(x)) # (shape: (batch_size, 64, num_points)) out = F.relu(self.conv2(out)) # (shape: (batch_size, 64, num_points)) point_features = out out = F.relu(self.conv3(out)) # (shape: (batch_size, 64, num_points)) out = F.relu(self.conv4(out)) # (shape: (batch_size, 128, num_points)) out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points)) global_feature = self.max_pool(out) # (shape: (batch_size, 1024, 1)) global_feature_repeated = global_feature.repeat(1, 1, self.num_points) # (shape: (batch_size, 1024, num_points)) out = torch.cat([global_feature_repeated, point_features], 1) # (shape: (batch_size, 1024+64=1088, num_points)) out = F.relu(self.conv6(out)) # (shape: (batch_size, 512, num_points)) out = F.relu(self.conv7(out)) # (shape: (batch_size, 256, num_points)) out = F.relu(self.conv8(out)) # (shape: (batch_size, 128, num_points)) out = F.relu(self.conv9(out)) # (shape: (batch_size, 128, num_points)) out = self.conv10(out) # (shape: (batch_size, 2, num_points)) out = out.transpose(2,1).contiguous() # (shape: (batch_size, num_points, 2)) out = F.log_softmax(out.view(-1, 2), dim=1) # (shape: (batch_size*num_points, 2)) out = out.view(batch_size, self.num_points, 2) # (shape: (batch_size, num_points, 2)) return out Num = 0 network = InstanceSeg() network.cuda() while(1): input0 = torch.randn(32, 3, 1024).cuda() input1 = torch.randn(32, 3, 1024).cuda() input2 = torch.randn(32, 3, 1024).cuda() input = torch.cat((input0, input1, input2), 1) out = network(input) Num = Num+1 print(Num) ``` After random number of steps, error raises. The error report is ``` Traceback (most recent call last): File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 58, in <module> input0 = torch.randn(32, 3, 1024).cuda() RuntimeError: CUDA error: an illegal memory access was encountered ``` When I added "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'" at the top of this script, the error report was changed to this ``` Traceback (most recent call last): File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 64, in <module> out = network(input) File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 35, in forward out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points)) File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 187, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED ``` I know some wrong indexing operations and some wrong usage method of loss function may lead to illegal memory access error. But in this script, there is no such kind of operation. I am quite sure this error is not because of out of memory since only about 2G GPU memory is used, and I have totally 12G GPU memory. This is my environment information: ``` OS: Ubuntu 16.04 LTS 64-bit Command: conda install pytorch torchvision cudatoolkit=9.0 -c pytorch GPU: Titan XP Driver Version: 410.93 Python Version: 3.6 cuda Version: cuda_9.0.176_384.81_linux cudnn Version: cudnn-9.0-linux-x64-v7.4.2.24 pytorch Version: pytorch-1.0.1-py3.6_cuda9.0.176_cudnn7.4.2_2 ``` I have been stuck here for long time. In fact, not only this project faces this error, many other projects face similar error in my computer. I don't think there is anything wrong in the code. It can run correctly for some steps. Maybe this error is because the environment. I am not sure. Does anyone have any idea about this situation? If more detailed information is needed, please let me know. Thanks for any suggestion.

moon_knight · March 20, 2022, 3:11pm

Current PyTorch version in Kaggle notebook is ‘1.9.1’.
If ‘lower batch sizes’ is suggested workaround then why batch_size = 16 is not working but batch_size = 32 is working?