I think you need to update your drivers
I would also suggest trying a reboot, that can sometimes fix things itself
What @wdhorton, and as I said in the other thread unfortunately we donât know how to help you, as itâs a pytorch issue. So I have been collecting notes from people who overcame that issue, but pytorch forums is really the best place to ask if what we collected so far wasnât helpful.
Really you should be able to run import torch
and only then try fastai. I hope this makes sense.
I have encountered this issue myself and indeed reinstalling the nvidia drivers (not cuda drivers) and rebooting solved the problem.
Remember, pytorch-1 doesnât care what cuda drivers you have installed on your system. it fetches its own. so you can do anything you want system-wide.
And as you debug this please take notes and then share with everybody how you resolved this issue. Thank you.
Thanks for the suggestions @wdhorton .
Drivers are already the â396.xx+ driverâ as required in the Troubleshooting guide.
Rebooted, but error persists.
This is not enough. You need to check that you donât have other nvidia drivers installed (perhaps from manual install). Thatâs why the recommendation is to wipe your system clean first. Check that no nvidia driver is left (not just through package manager, but using locate or whatever else you to search your filesystem. Make sure nothing is left. then install a single nvidia driver. reboot and check then.
So this is a PyTorch nighty-build issue, to be specific.
Will FastAI work with a previous âmainâ release version of PyTorch, or does it require the nightly build?
Because as I said, I had no problem with PyTorch (0.4.0), and I could just go back to using that if Fast.ai will âtalk to itâ appropriately.
pytorch nightly-build is pre-1.0 - itâs required by fastai-1.x
the other approach is that you can build pytorch from source - then itâll use your systemâs cuda. see pytorch.org for details itâs pretty straightforward. https://github.com/pytorch/pytorch#from-source
Hi,
I am just starting the first lesson and I have a problem. The install seemed to go ok in that it can import fastai. But then it tries to import the images using this path
/home/pete/.fastai/data/dogscats
With that path it cannot find a âvalidâ directory.
I assume that this is a simple problem due to my unfamiliarity with the software. Can anyone help?
Thanks
Just had a weird fallout on my conda env switching/activating after updating to a recent conda version. It suggested that I need to remove:
export PATH="/home/stas/anaconda3/envs/pytorch-dev/bin:$PATH"
from .bashrc
since it was only used prior to conda 4.4, which I did, and then it fell apart completely not letting me do:
conda activate <other-env>
complaining with:
CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If your shell is Bash or a Bourne variant, enable conda for the current user with
$ echo ". /home/stas/anaconda3/envs/pytorch-dev/etc/profile.d/conda.sh" >> ~/.bashrc
or, for all users, enable conda with
$ sudo ln -s /home/stas/anaconda3/envs/pytorch-dev/etc/profile.d/conda.sh /etc/profile.d/conda.sh
The options above will permanently enable the 'conda' command, but they do NOT
put conda's base (root) environment on PATH. To do so, run
$ conda activate
in your terminal, or to put the base environment on PATH permanently, run
$ echo "conda activate" >> ~/.bashrc
Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
your ~/.bashrc file. You should manually remove the line that looks like
export PATH="/home/stas/anaconda3/envs/pytorch-dev/bin:$PATH"
^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^
Which was just plain wrong, since after I added what it recommended things got even worse. After some research I found that the correct change was to add this:
# get the conda env switching work
. /home/stas/anaconda3/etc/profile.d/conda.sh
to .bashrc instead. Of course yours would be adjusted for your /home/user
.
Managed to get it working! Turns out there were two slightly different modules for nvidia-396 attached to the kernel via depmod
that showed up when I ran dkms status
.
Deleting both of them and then reinstalling nvidia-396, and then rebooting, (and then repeating the usual fastai installation) did the trick.
Proof:
>>> from fastai import *
>>> torch.cuda.is_available()
True
Thanks for your help.
For those who may run into similar problems: I found this post from 2013 on purging NVIDIA drivers to be quite helpful.
EDIT: To be more specific on what I didâŚ
-
I did all of this via ssh into my machine. (I have no idea what happens when you remove an NVIDIA driver if youâre physically on the console.)
-
I ran
$ sudo apt list --installed | grep nvidia
This was only showing nvidia-396 and no other drivers. This did not help. Also I ran$ sudo apt-get purge nvidia-*
but this didnât seem to help. -
I had previously installed CUDA 9.2 by downloading and running the linux ârun scriptâ from the NVIDIA website. So now, re-running this script (as root) with
--uninstall
eventually gave the message âThere are no NVIDIA drivers on your system.â This still was not a complete solution. -
$ dkms status
showed that there were two nvidia-396âs and matching versions of something called `bbswitch:$ dkms status
bbswitch, 0.8, 4.4.0-135-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-137-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-135-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-137-generic, x86_64: installed
So I removed the older ones
$ sudo dkms remove bbswitch/0.8 -k 4.4.0-135-generic
$ sudo dkms remove nvidia-396/396.54 -k 4.4.0-135-generic
- Now the following few steps may have been overkill, but⌠Just to be sure, I removed nvidia-396 completely:
$ sudo apt-get remove nvidia-396
and then rebooted. - At this point, there was no driver.
nvidia-smi
did not work, because it said there was no driver. - Now I re-installed nvidia-396:
$ sudo apt-get install nvidia-396
and even though it gave messages about cuda 9.0 and 9.1 and nothing about 9.2, I persisted. - Rebooted. Now
nvidia-smi
worked. - Installed fast.ai into a fresh
conda
environment as directed. Everything works.
awesome!
but can you be more specific of how you diagnosed it with commands and everything so that others could follow your steps, remember most people will have no idea what you did to solve it. How come you had two nvidia modules in your kernel? what were the names in depmod output, etc. I hope you understand what Iâm after: Here are the steps to identify and solve this problem: âŚ
The article you linked to is only helpful for fixing the issue, what we need is clear steps for identifying the problem.
Thank you.
One crucial step from the article you linked I took is to add to the troubleshooting docs is:
Purge, not just remove, all installed NVIDIA packages:
1 `$ sudo apt-get purge nvidia-*`
So this is already very useful.
Ok, Iâll edit my earlier post and add more information.
Thank you for taking the time to write out the steps, @drscotthawley! thatâs excellent.
Any nvidia experts here? When I run this on my machine I too get doubles:
$ dkms status
bbswitch, 0.8, 4.15.0-34-generic, x86_64: installed
bbswitch, 0.8, 4.15.0-36-generic, x86_64: installed
nvidia-396, 396.44, 4.15.0-34-generic, x86_64: installed
nvidia-396, 396.44, 4.15.0-36-generic, x86_64: installed
virtualbox, 5.2.10, 4.15.0-34-generic, x86_64: installed
virtualbox, 5.2.10, 4.15.0-36-generic, x86_64: installed
yet, I donât have this problem with pytorch. Perhaps this particular double is normal and not a harbinger of this issue? Anybody with nvidia knowledge to support?
It is possible that this was really the key?
sudo apt-get purge nvidia-*
I think itâs the one that will also remove dkms modules, right?
I guess it would be hard for you to tell now since the original setup is no more.
I ran into this issue as well, solved it by downgrading notebook
:
conda install notebook==5.6.0
There seems to be an issue on latest 5.7.0 see https://github.com/jupyter/notebook/issues/3946
Thanks for the pointer and the explanation. However, this method assumes you want to install the full version of anaconda. I installed miniconda instead of the full anaconda as the full version is too heavy and many libraries arenât used in fastai.
For people who want to continue using miniconda and install only the libraries necessary for the class,
conda update conda
then
conda update -c fastai fastai
worked for me. I installed my conda env using Python 3.6 and it only updated to 4.5.11 instead of 5.3.0 (suppose itâs based on Python 3.7), but it was able to update fastai to the latest stable version of 1.0.11
youâre correct, @shoof. I made an assumption you were on anaconda.
The dependencies were adjusted last night, so the next fastai release will fix it for you automatically.
Thank you for your feedback.
what is it showing now after you purged, reinstalled the nvidia drivers and rebootted, @drscotthawley.
I have added few steps in this forum which might be useful
I have integrated your input here, @drscotthawley:
https://docs-dev.fast.ai/troubleshoot.html#correctly-configured-nvidia-drivers
Please, let me know if there is anything else that would be useful to add.
As of this morning, without having rebooted (to my knowledge â sometimes my workplace has random power outages at night) Iâm seeing new duplicate versions of both bbswitch and nvidia-396
$ dkms status
bbswitch, 0.8, 4.4.0-137-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-138-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-137-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-138-generic, x86_64: installed
âŚand yet the fastai & pytorch libraries still working fine with no CUDA errors. So, it seems that having those duplicates was not the key problem. That part of the troubleshooting instructions could probably be removed.
Perhaps it was the sudo apt-get purge nvidia-*
that actually was responsible for the fix. I apologize for not cataloguing my actions more systematically as I was doing them; I was just trying to get something working.
The general strategy for users, is the principle of removing everything NVIDIA & CUDA-related (including perhaps even downloading the various NVIDIA & CUDA installation scripts but running them with the --uninstall
option) until utilities such as nvidia-smi
no longer work. (This may include rebooting one or more times). At that point, it is safe to install the nvidia-xxx
driver again, and reboot, and install pytorch and fastaiv1.