Issue with torch.cuda.set_device(3)


#1

I am trying to run the example of fastai/courses/dl2/wgan.ipynb on Paperspace.
At the second cell of the code,
from fastai.conv_learner import *
from fastai.dataset import *
import gzip
torch.cuda.set_device(3)

I get this error:

RuntimeError Traceback (most recent call last)
in ()
2 from fastai.dataset import *
3 import gzip
----> 4 torch.cuda.set_device(3)

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/cuda/init.py in set_device(device)
260 “”"
261 if device >= 0:
–> 262 torch._C._cuda_setDevice(device)
263
264

RuntimeError: cuda runtime error (10) : invalid device ordinal at torch/csrc/cuda/Module.cpp:32

If I comment out the line torch.cuda.set_device(3), it returns an error later at the code.

Any help on this please?


(Ramesh Sampath) #2

Unless you have four GPUs in your machine, you should comment that line. I comment it all the time.

Paste the error you receive later at the code.


#3

Ramesh, I am using the Paperspace’s machine from Part 1. How can I see how many GPUs are there? From the description there is this: GPU: 8 GB

The error later is this:
RuntimeError: Expected object of type torch.FloatTensor but found type torch.cuda.FloatTensor for argument #2 ‘weight’


#4

It’s a single GPU machine for training.

In any case, the later error still exists…


(Ramesh Sampath) #5

This may mean that you have the dataset in Cuda (GPU) but this operation was expecting it to be in CPU. If you can provide more details on what cell in the notebook you are running it might help. Also, the Part 2 Videos are not yet released. So, you might want to wait till those are released before delving into the notebooks.


#6

I understand, I might have to wait.

A general question, is it odd to expect this example of Part 2 to run with the Paperspace option? The issues I have are related to this?


(Ramesh Sampath) #7

Sure. The Part 2 should work with Paperspace. When you watch the videos you will see Jeremy talk about these things and will be much clear.


#8

I’ve only used Paperspace for all the second part of the course and it has worked perfectly. A few things might require a P6000 but by lowering your batch size you should always be able to run the notebooks.
Like Ramesh said however, the videos will probably help a lot!


#9

Thanks sgugger! I guess that you have chosen an increased storage (>50GB) to run this particular example with the ‘lsun’ dataset. Again, my intention is to run the ‘fastai/courses/dl2/wgan.ipynb’ but the default storage size of the fastai machine is inhibiting.


#10

The 20% version on Kaggle is only 5GB. :wink:


#11

I am now able to run the lsun dataset with Paperspace.

At the original wgan code, I did change these:

a. #torch.cuda.set_device(3)
b. #netG = DCGAN_G(sz, nz, 3, 64, 1).cuda()
#netD = DCGAN_D(sz, 3, 64, 1).cuda()

netG = DCGAN_G(sz, nz, 3, 64, 1)
netD = DCGAN_D(sz, 3, 64, 1)

The first error I get is later at the cell:
netD.load_state_dict(torch.load(TMP_PATH/‘netD_2.h5’))
netG.load_state_dict(torch.load(TMP_PATH/‘netG_2.h5’))


FileNotFoundError Traceback (most recent call last)
in ()
----> 1 netD.load_state_dict(torch.load(TMP_PATH/‘netD_2.h5’))
2 netG.load_state_dict(torch.load(TMP_PATH/‘netG_2.h5’))

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/serialization.py in load(f, map_location, pickle_module)
299 (sys.version_info[0] == 3 and isinstance(f, pathlib.Path)):
300 new_fd = True
–> 301 f = open(f, ‘rb’)
302 try:
303 return _load(f, map_location, pickle_module)

FileNotFoundError: [Errno 2] No such file or directory: ‘data/lsun/tmp/netD_2.h5’

Any help on this plz?


#12

This is to load a saved model. Jeremy had trained a bit and saved some files named netD_2.h5 and netG_2.h5 but since you’re starting from scratch, those lines won’t work.


#13

@sgugger Can you please guide me how to proceed at this point?


#14

Like I said, you should skip those lines and only use them once you’ve saved your own models.