Fastai v1 install issues thread

drscotthawley · October 23, 2018, 6:06am

Managed to get it working! Turns out there were two slightly different modules for nvidia-396 attached to the kernel via depmod that showed up when I ran dkms status.
Deleting both of them and then reinstalling nvidia-396, and then rebooting, (and then repeating the usual fastai installation) did the trick.
Proof:

>>> from fastai import *
>>> torch.cuda.is_available()
True

Thanks for your help.

For those who may run into similar problems: I found this post from 2013 on purging NVIDIA drivers to be quite helpful.

EDIT: To be more specific on what I did…

I did all of this via ssh into my machine. (I have no idea what happens when you remove an NVIDIA driver if you’re physically on the console.)
I ran $ sudo apt list --installed | grep nvidia This was only showing nvidia-396 and no other drivers. This did not help. Also I ran $ sudo apt-get purge nvidia-* but this didn’t seem to help.
I had previously installed CUDA 9.2 by downloading and running the linux “run script” from the NVIDIA website. So now, re-running this script (as root) with --uninstall eventually gave the message “There are no NVIDIA drivers on your system.” This still was not a complete solution.
$ dkms status showed that there were two nvidia-396’s and matching versions of something called `bbswitch:

$ dkms status
bbswitch, 0.8, 4.4.0-135-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-137-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-135-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-137-generic, x86_64: installed

So I removed the older ones

$ sudo dkms remove bbswitch/0.8 -k 4.4.0-135-generic
$ sudo dkms remove nvidia-396/396.54 -k 4.4.0-135-generic

Now the following few steps may have been overkill, but… Just to be sure, I removed nvidia-396 completely:
$ sudo apt-get remove nvidia-396
and then rebooted.
At this point, there was no driver. nvidia-smi did not work, because it said there was no driver.
Now I re-installed nvidia-396: $ sudo apt-get install nvidia-396 and even though it gave messages about cuda 9.0 and 9.1 and nothing about 9.2, I persisted.
Rebooted. Now nvidia-smi worked.
Installed fast.ai into a fresh conda environment as directed. Everything works.

stas · October 23, 2018, 6:13am

awesome!

but can you be more specific of how you diagnosed it with commands and everything so that others could follow your steps, remember most people will have no idea what you did to solve it. How come you had two nvidia modules in your kernel? what were the names in depmod output, etc. I hope you understand what I’m after: Here are the steps to identify and solve this problem: …

The article you linked to is only helpful for fixing the issue, what we need is clear steps for identifying the problem.

Thank you.

One crucial step from the article you linked I took is to add to the troubleshooting docs is:

Purge, not just remove, all installed NVIDIA packages:

1 `$ sudo apt-get purge nvidia-*`

So this is already very useful.

drscotthawley · October 23, 2018, 6:21am

Ok, I’ll edit my earlier post and add more information.

stas · October 23, 2018, 6:45am

Thank you for taking the time to write out the steps, @drscotthawley! that’s excellent.

Any nvidia experts here? When I run this on my machine I too get doubles:

$ dkms status
bbswitch, 0.8, 4.15.0-34-generic, x86_64: installed
bbswitch, 0.8, 4.15.0-36-generic, x86_64: installed
nvidia-396, 396.44, 4.15.0-34-generic, x86_64: installed
nvidia-396, 396.44, 4.15.0-36-generic, x86_64: installed
virtualbox, 5.2.10, 4.15.0-34-generic, x86_64: installed
virtualbox, 5.2.10, 4.15.0-36-generic, x86_64: installed

yet, I don’t have this problem with pytorch. Perhaps this particular double is normal and not a harbinger of this issue? Anybody with nvidia knowledge to support?

It is possible that this was really the key?

sudo apt-get purge nvidia-*

I think it’s the one that will also remove dkms modules, right?

I guess it would be hard for you to tell now since the original setup is no more.

cdvel · October 23, 2018, 7:29am

I ran into this issue as well, solved it by downgrading notebook:

conda install notebook==5.6.0

There seems to be an issue on latest 5.7.0 see https://github.com/jupyter/notebook/issues/3946

shoof · October 23, 2018, 2:52pm

Thanks for the pointer and the explanation. However, this method assumes you want to install the full version of anaconda. I installed miniconda instead of the full anaconda as the full version is too heavy and many libraries aren’t used in fastai.

For people who want to continue using miniconda and install only the libraries necessary for the class,
conda update conda
then
conda update -c fastai fastai
worked for me. I installed my conda env using Python 3.6 and it only updated to 4.5.11 instead of 5.3.0 (suppose it’s based on Python 3.7), but it was able to update fastai to the latest stable version of 1.0.11

stas · October 23, 2018, 3:18pm

you’re correct, @shoof. I made an assumption you were on anaconda.

The dependencies were adjusted last night, so the next fastai release will fix it for you automatically.

Thank you for your feedback.

stas · October 23, 2018, 3:30pm

what is it showing now after you purged, reinstalled the nvidia drivers and rebootted, @drscotthawley.

nikhil.ikhar · October 23, 2018, 4:15pm

I have added few steps in this forum which might be useful
https://forums.fast.ai/t/platform-gcp/27375/58?u=nikhil.ikhar

stas · October 23, 2018, 5:00pm

I have integrated your input here, @drscotthawley:
https://docs-dev.fast.ai/troubleshoot.html#correctly-configured-nvidia-drivers
Please, let me know if there is anything else that would be useful to add.

drscotthawley · October 23, 2018, 7:00pm

As of this morning, without having rebooted (to my knowledge – sometimes my workplace has random power outages at night) I’m seeing new duplicate versions of both bbswitch and nvidia-396

$ dkms status

bbswitch, 0.8, 4.4.0-137-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-138-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-137-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-138-generic, x86_64: installed

…and yet the fastai & pytorch libraries still working fine with no CUDA errors. So, it seems that having those duplicates was not the key problem. That part of the troubleshooting instructions could probably be removed.

Perhaps it was the sudo apt-get purge nvidia-* that actually was responsible for the fix. I apologize for not cataloguing my actions more systematically as I was doing them; I was just trying to get something working.

The general strategy for users, is the principle of removing everything NVIDIA & CUDA-related (including perhaps even downloading the various NVIDIA & CUDA installation scripts but running them with the --uninstall option) until utilities such as nvidia-smi no longer work. (This may include rebooting one or more times). At that point, it is safe to install the nvidia-xxx driver again, and reboot, and install pytorch and fastaiv1.

techjoey · October 23, 2018, 7:27pm

Hello! I am having trouble trying to run the Practical Deep Learning Class Development Environment setup in Paperspace. I was going to use this post to help me, but I received an error message when I clicked on “main” below Paperspace in the tread. The error message: “Sorry, you don’t have access to that topic!” Can you please give me access / open this up to the public? Thank you!

stas · October 23, 2018, 9:56pm

drscotthawley:

As of this morning, without having rebooted (to my knowledge – sometimes my workplace has random power outages at night) I’m seeing new duplicate versions of both bbswitch and nvidia-396
$ dkms status

bbswitch, 0.8, 4.4.0-137-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-138-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-137-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-138-generic, x86_64: installed
…and yet the fastai & pytorch libraries still working fine with no CUDA errors. So, it seems that having those duplicates was not the key problem. That part of the troubleshooting instructions could probably be removed.

That’s good to know. I wonder if someone here knows why the double entries with a slight change. perhaps those are just last 2 kernel builds?

Perhaps it was the sudo apt-get purge nvidia-* that actually was responsible for the fix. I apologize for not cataloguing my actions more systematically as I was doing them; I was just trying to get something working.

most likely. I will leave the dkms section for now and adjust it later once I read more success/failure reports, since the article you originally posted was mentioning removing dkms modules manually. I don’t think sudo apt-get purge nvidia-* will be able to do that for things it didn’t create - and many people install NVIDIA drivers manually - it was the recommended way until recently at least.

The general strategy for users, is the principle of removing everything NVIDIA & CUDA-related (including perhaps even downloading the various NVIDIA & CUDA installation scripts but running them with the --uninstall option) until utilities such as nvidia-smi no longer work . (This may include rebooting one or more times). At that point, it is safe to install the nvidia-xxx driver again, and reboot, and install pytorch and fastaiv1.

That one is quite murky too. If you reboot before installing new modules, I think, Ubuntu, at least, may switch to noveau driver automatically and then you have to clean that up too. So perhaps cleaning up and reinstalling w/o reboot is a safer approach. Let’s see.

Thank you for your feedback and suggestions so far, @drscotthawley.

stas · October 23, 2018, 10:01pm

my apologies, I linked to the section that is open only for the class attendees. it has been fixed now. If you find good posts that are specific to paperspace please link them from the first post - anybody can edit it. Thank you. Just make sure to pay attention that they discussion fastai-1.0 - most pre-fall-2018 posts talk about fastai-0.7 which is not what you want. If you can’t find any please start a new thread instead.

balnazzar · October 23, 2018, 11:53pm

Mates, it is not strictly an “install issue” so I posted in another section, but my fastai v1 (sucessful) installation refuses to work properly.

This is the thread: Python refuses abiding by SIGKILL

Maybe the maintainers could take a look.

abi · October 24, 2018, 1:38am

Hi @stas ,

In the installation instruction you write:

However, note, that you most likely will need 396.xx+ driver for pytorch built with cuda92. For older drivers you will probably need to install pytorch with cuda90 or ever earlier.

Is the 396.xx driver a hard requirement for PyTorch 1.0 and/or fastai-v1 (used for the latest course)?

My machine currently has an NVIDIA-SMI 367.106 driver with CUDA 8.0. Can I install pytorch-nightly with the lastest fastai with my old 367 driver?

Didn’t want to mess up my current working driver setup. So paused updating the driver until I got some guidance. I have wasted days historically getting the drivers to install correctly and create separate working environments for fastai and keras.

Any guidance on if 396 driver is a hard requirement is appreciated.

stas · October 24, 2018, 2:26am

I did a bit of googling and this seems to be the requirements for the 3 cuda versions pytorch currently supports in binary form:

CUDA 8.0 requires NVIDIA 361+
CUDA 9.0 requires NVIDIA 384+
CUDA 9.2 requires NVIDIA 396+

So yes, you should be able to use it. Just change the install instruction to:

conda install -c pytorch pytorch-nightly cuda80

or:

pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu80/torch_nightly.html

update: I documented the requirements here.

update: I found the authoritative source with all of it, including cuda 10!

CUDA Toolkit	Linux x86_64 Driver Version	Windows x86_64 Driver Version
CUDA 10.0.130	>= 410.48	>= 411.31
CUDA 9.2 (9.2.148 Update 1)	>= 396.37	>= 398.26
CUDA 9.2 (9.2.88)	>= 396.26	>= 397.44
CUDA 9.1 (9.1.85)	>= 390.46	>= 391.29
CUDA 9.0 (9.0.76)	>= 384.81	>= 385.54
CUDA 8.0 (8.0.61 GA2)	>= 375.26	>= 376.51
CUDA 8.0 (8.0.44)	>= 367.48	>= 369.30
CUDA 7.5 (7.5.16)	>= 352.31	>= 353.66
CUDA 7.0 (7.0.28)	>= 346.46	>= 347.62

abi · October 24, 2018, 2:31am

Perfect, thanks!

Thank you for the installation trouble-shooting documentation. This is the best doc that I have come across for understanding GPU driver/CUDA/Cudnn/PyTorch/Fastai installation requirements. 10x better than nVidia’s own documentation

arisbw · October 24, 2018, 10:10am

Hello, does anyone succeed to install and use v1 in Colab? I tried to install the dependencies by using

!pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu90/torch_nightly.html
!pip install fastai

At first, I thought this will work because I can import fastai. But when I tried to do some training, this error appeared:

RuntimeError: DataLoader worker (pid 240) is killed by signal: Bus error.

This was weird because when I checked this error it’s probably caused by limited shared memory. But I immediately check GPU memory and it was barely used (<1 GB from 12 GB space). Is there something that I need to do first in order to use it in Colab?

sgugger · October 24, 2018, 2:04pm

This has been asked and answered many times, here for instance. The short answer is Colab doesn’t support pytorch v1 yet so it doesn’t support fastai v1.