Hi,everyone!
I met a strange illegal memory access error. It happens randomly w…ithout any regular pattern.
The code is really simple. It is PointNet for point cloud segmentation. I don't think there is anything wrong in the code.
```python
import torch
import torch.nn as nn
import torch.nn.functional as F
import os
class InstanceSeg(nn.Module):
def __init__(self, num_points=1024):
super(InstanceSeg, self).__init__()
self.num_points = num_points
self.conv1 = nn.Conv1d(9, 64, 1)
self.conv2 = nn.Conv1d(64, 64, 1)
self.conv3 = nn.Conv1d(64, 64, 1)
self.conv4 = nn.Conv1d(64, 128, 1)
self.conv5 = nn.Conv1d(128, 1024, 1)
self.conv6 = nn.Conv1d(1088, 512, 1)
self.conv7 = nn.Conv1d(512, 256, 1)
self.conv8 = nn.Conv1d(256, 128, 1)
self.conv9 = nn.Conv1d(128, 128, 1)
self.conv10 = nn.Conv1d(128, 2, 1)
self.max_pool = nn.MaxPool1d(num_points)
def forward(self, x):
batch_size = x.size()[0] # (x has shape (batch_size, 9, num_points))
out = F.relu(self.conv1(x)) # (shape: (batch_size, 64, num_points))
out = F.relu(self.conv2(out)) # (shape: (batch_size, 64, num_points))
point_features = out
out = F.relu(self.conv3(out)) # (shape: (batch_size, 64, num_points))
out = F.relu(self.conv4(out)) # (shape: (batch_size, 128, num_points))
out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
global_feature = self.max_pool(out) # (shape: (batch_size, 1024, 1))
global_feature_repeated = global_feature.repeat(1, 1, self.num_points) # (shape: (batch_size, 1024, num_points))
out = torch.cat([global_feature_repeated, point_features], 1) # (shape: (batch_size, 1024+64=1088, num_points))
out = F.relu(self.conv6(out)) # (shape: (batch_size, 512, num_points))
out = F.relu(self.conv7(out)) # (shape: (batch_size, 256, num_points))
out = F.relu(self.conv8(out)) # (shape: (batch_size, 128, num_points))
out = F.relu(self.conv9(out)) # (shape: (batch_size, 128, num_points))
out = self.conv10(out) # (shape: (batch_size, 2, num_points))
out = out.transpose(2,1).contiguous() # (shape: (batch_size, num_points, 2))
out = F.log_softmax(out.view(-1, 2), dim=1) # (shape: (batch_size*num_points, 2))
out = out.view(batch_size, self.num_points, 2) # (shape: (batch_size, num_points, 2))
return out
Num = 0
network = InstanceSeg()
network.cuda()
while(1):
input0 = torch.randn(32, 3, 1024).cuda()
input1 = torch.randn(32, 3, 1024).cuda()
input2 = torch.randn(32, 3, 1024).cuda()
input = torch.cat((input0, input1, input2), 1)
out = network(input)
Num = Num+1
print(Num)
```
After random number of steps, error raises. The error report is
```
Traceback (most recent call last):
File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 58, in <module>
input0 = torch.randn(32, 3, 1024).cuda()
RuntimeError: CUDA error: an illegal memory access was encountered
```
When I added "os.environ['CUDA_LAUNCH_BLOCKING'] = '1'" at the top of this script, the error report was changed to this
```
Traceback (most recent call last):
File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 64, in <module>
out = network(input)
File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/wangye/Frustum-PointNet_Test/frustum_pointnet.py", line 35, in forward
out = F.relu(self.conv5(out)) # (shape: (batch_size, 1024, num_points))
File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__
result = self.forward(*input, **kwargs)
File "/home/wangye/anaconda3/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 187, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
```
I know some wrong indexing operations and some wrong usage method of loss function may lead to illegal memory access error. But in this script, there is no such kind of operation.
I am quite sure this error is not because of out of memory since only about 2G GPU memory is used, and I have totally 12G GPU memory.
This is my environment information:
```
OS: Ubuntu 16.04 LTS 64-bit
Command: conda install pytorch torchvision cudatoolkit=9.0 -c pytorch
GPU: Titan XP
Driver Version: 410.93
Python Version: 3.6
cuda Version: cuda_9.0.176_384.81_linux
cudnn Version: cudnn-9.0-linux-x64-v7.4.2.24
pytorch Version: pytorch-1.0.1-py3.6_cuda9.0.176_cudnn7.4.2_2
```
I have been stuck here for long time.
In fact, not only this project faces this error, many other projects face similar error in my computer.
I don't think there is anything wrong in the code. It can run correctly for some steps. Maybe this error is because the environment. I am not sure.
Does anyone have any idea about this situation? If more detailed information is needed, please let me know.
Thanks for any suggestion.