Windows: RuntimeError: cuda runtime error (801) & RuntimeError: Expected object of scalar type Long but got scalar type

I am using fastai v2 on a windows system and testing on the pets notebook.

Current version:

Cuda: True
GPU: GeForce GTX 1060
Python version: 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
Pytorch version: 1.3.0

I got the following error after running this code:

pets = DataBlock(types=(PILImage, Category),
get_items=get_image_files,
splitter=RandomSplitter(),
#get_y=RegexLabeller(pat = r’/([^/]+)\d+.jpg$’))
get_y = RegexLabeller(pat = r’\([^\]+)
\d+.jpg$’)) #For windows

dbunch = pets.databunch(untar_data(URLs.PETS)/“images”, item_tfms=RandomResizedCrop(460, min_scale=0.75), bs=32,
batch_tfms=[*aug_transforms(size=224, max_warp=0), Normalize(*imagenet_stats)])

learn.fit_one_cycle(4)

results in the following error:

RuntimeError: cuda runtime error (801) : operation not supported at C:\w\1\s\tmp_conda_3.7_183424\conda\conda-bld\pytorch_1570818936694\work\torch/csrc/generic/StorageSharing.cpp:245

Looking at this thread https://pytorch.org/docs/stable/notes/windows.html#cuda-ipc-operations it points that multiprocessing on CUDA tensors are not supported and offered 2 alternatives one being change num_worker to 0, which I did:

dbunch = pets.databunch(untar_data(URLs.PETS)/“images”, item_tfms=RandomResizedCrop(460, min_scale=0.75), bs=32,
batch_tfms=[*aug_transforms(size=224, max_warp=0), Normalize(*imagenet_stats)], num_workers=0)

This then resulted in a different error when running one_fit:

RuntimeError: Expected object of scalar type Long but got scalar type Int for argument #2 ‘target’ in call to _thnn_nll_loss_forward

The error was being generated here:

~\Anaconda3\envs\fastai_v2_1.3\lib\site-packages\torch\nn\functional.py in nll_loss(input, target, weight, size_average, ignore_index, reduce, reduction)
1837 if dim == 2:
-> 1838 ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
1839 elif dim == 4:
1840 ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

so I added a line:

if dim == 2:
    target = target.long() #new input
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)

and now the notebook works fine.

However are there better suggestions on how to fix this?

1 Like

This is because we removed the automatic conversions to long ints in the data preprocesseing pipeline, since we wondered why it was there (now we know :slight_smile: ). There’ll be a fix today.

2 Likes

Awesome thanks!

@amritv Does the latest version work for you now? I am getting a similar error on windows after completion of the first epoch in the pets notebook

RuntimeError: Expected object of scalar type Int but got scalar type Long for argument #2 ‘other’

coming from

fastai_dev\fastai2\torch_core.py in _f(self, *args, **kwargs)
155 def _f(self, *args, **kwargs):
156 cls = self.class
–> 157 res = getattr(super(TensorBase, self), fn)(*args, **kwargs)
158 return cls(res) if isinstance(res,Tensor) else res
159 return _f

Hey @cudawarped, just tried it out and it worked after a

git pull

and

conda env update

However I notice that your error was generated elsewhere.

1 Like

Hey @amritv thanks a million for checking

conda env update

has removed the errors and its training without issue.

1 Like

@sgugger, the issue still exists in the recent release for windows… I use 1.2 torch since 1.3 is not available on their site with cuda… the peds example not working…

d:\conda3\lib\site-packages\fastai2\learner.py in accumulate(self, learn)
431 def accumulate(self, learn):
432 bs = find_bs(learn.yb)
–> 433 self.total += to_detach(self.func(learn.pred, *learn.yb))*bs
434 self.count += bs
435 @property

d:\conda3\lib\site-packages\fastai2\metrics.py in error_rate(inp, targ, axis)
79 def error_rate(inp, targ, axis=-1):
80 "1 - accuracy"
—> 81 return 1 - accuracy(inp, targ, axis=axis)
82
83 # Cell

d:\conda3\lib\site-packages\fastai2\metrics.py in accuracy(inp, targ, axis)
74 “Compute accuracy with targ when pred is bs * n_classes”
75 pred,targ = flatten_check(inp.argmax(dim=axis), targ)
—> 76 return (pred == targ).float().mean()
77
78 # Cell

d:\conda3\lib\site-packages\fastai2\torch_core.py in _f(self, *args, **kwargs)
270 def _f(self, *args, **kwargs):
271 cls = self.class
–> 272 res = getattr(super(TensorBase, self), fn)(*args, **kwargs)
273 return retain_type(res, self)
274 return _f

RuntimeError: Expected object of scalar type Int but got scalar type Long for argument #2 ‘other’

fastai v2 has not been tested on Windows and support for windows is not a priority right now (first, let’s finish it and document it :wink: ). I don’t think this will be dealt with until March.

5 Likes

Linking another answer to this thread which allowed me to fix this specific cuda runtime 801 error in intro notebook of fastbook.
Setting num_workers=0 in ImageDataLoaders.from_name_func made it work for me.

2 Likes