Problems with torch 1.3.0 and cuda on GCP

Hi all!

I’m setting up a new GCP instance to use with fastai v2, which I created using the instructions in https://course.fast.ai/start_gcp.html . From the base environment, torch.cuda.is_available() returns True. However, from the fastai_dev environment (i.e. the one specified by environment.yml), I get False. I edited the environment.yml to specify pytorch==1.2.0 and now cuda is available (yay!). Partly I’m posting this in case others run into the same issue, but also I’m wondering

  • is there a better solution? (I did try updating various cuda things but ended up with a more completely broken configuration and tbh don’t have a good understanding of how those pieces interact)
  • will using pytorch==1.2.0 be problematic later on?

Thanks for reading!

I had some issues with 1.3.0 where conda would install the cpu version of PyTorch instead. Think it might be related to cudatoolkit versions with conda choosing the CPU version instead of finding matching versions of pytorch and cudatoolkit (and perhaps torchvision too).
You can specify a build tag in the package spec to force a cuda version. For example pytorch=1.3=*cuda10.1* (as cuda libraries are installed in conda this shouldn’t really need to match the system one, but not a bad idea to match as otherwise you can get issues if things get mixed). You could also just do pytorch=1.3=*cuda* to allow any cuda but then it may (but probably shouldn’t) end up with cuda 9.2. Tested it with conda install -c pytorch "pytorch=1.3=*cuda10.1*" but should work in evironment.yml (you need the quotes on the command line but shouldn’t in the .yml, though you might need spaces between the components like pytorch =1.3 =*cuda10.1*).

1 Like

Thanks for the recommendation! I gave this a try (in environment.yml) and it did not work immediately, but I’ll experiment a bit more with it.

Ah, maybe the matching on build tags won’t work in an environment.yml. You should at least be able to use exact tags. Conda uses them in generated environment.yml’s so this should work:

dependencies:
  - pytorch=1.3.0 =py3.7_cuda10.1.243_cudnn7.6.3_0

You can do a conda search to see tags.

Thinking a bit more:
There probably should be an an explicit requirement on cudatoolkit in the faastai-dev environment.yml. In this case there’s not really anything in the package specifications to tell conda to prefer the CUDA version of PyTorch above the CPU version. Don’t know if you can specify an environment.yml and also specify packages to install on the command line, but if possible specifying the cudatoolkit package might also work.
I think you should actually be able to install the CUDA version on a CPU-only system, so could add it to the default environment.yml without breaking. All the runtime checks should still see there’s no GPU and not use CUDA. It’ll just pull in some unneeded stuff. Probably easier than maintaining multiple environment.ymls at least while in development. In the final a more efficient CPU only environment.yml for inference deployment might be nice.