Fastai v1 install issues thread


(William Horton) #124

I think you need to update your drivers


(William Horton) #125

I would also suggest trying a reboot, that can sometimes fix things itself


(Stas Bekman) #126

What @wdhorton, and as I said in the other thread unfortunately we don’t know how to help you, as it’s a pytorch issue. So I have been collecting notes from people who overcame that issue, but pytorch forums is really the best place to ask if what we collected so far wasn’t helpful.

Really you should be able to run import torch and only then try fastai. I hope this makes sense.

I have encountered this issue myself and indeed reinstalling the nvidia drivers (not cuda drivers) and rebooting solved the problem.

Remember, pytorch-1 doesn’t care what cuda drivers you have installed on your system. it fetches its own. so you can do anything you want system-wide.

And as you debug this please take notes and then share with everybody how you resolved this issue. Thank you.


(Scott H Hawley) #127

Thanks for the suggestions @wdhorton .
Drivers are already the “396.xx+ driver” as required in the Troubleshooting guide.
Rebooted, but error persists.


(Stas Bekman) #128

This is not enough. You need to check that you don’t have other nvidia drivers installed (perhaps from manual install). That’s why the recommendation is to wipe your system clean first. Check that no nvidia driver is left (not just through package manager, but using locate or whatever else you to search your filesystem. Make sure nothing is left. then install a single nvidia driver. reboot and check then.


(Scott H Hawley) #129

So this is a PyTorch nighty-build issue, to be specific.
Will FastAI work with a previous “main” release version of PyTorch, or does it require the nightly build?

Because as I said, I had no problem with PyTorch (0.4.0), and I could just go back to using that if Fast.ai will ‘talk to it’ appropriately.


(Stas Bekman) #130

pytorch nightly-build is pre-1.0 - it’s required by fastai-1.x

the other approach is that you can build pytorch from source - then it’ll use your system’s cuda. see pytorch.org for details it’s pretty straightforward. https://github.com/pytorch/pytorch#from-source


(p wills) #131

Hi,
I am just starting the first lesson and I have a problem. The install seemed to go ok in that it can import fastai. But then it tries to import the images using this path
/home/pete/.fastai/data/dogscats
With that path it cannot find a “valid” directory.

I assume that this is a simple problem due to my unfamiliarity with the software. Can anyone help?

Thanks


(Stas Bekman) #132

Just had a weird fallout on my conda env switching/activating after updating to a recent conda version. It suggested that I need to remove:

export PATH="/home/stas/anaconda3/envs/pytorch-dev/bin:$PATH"

from .bashrc since it was only used prior to conda 4.4, which I did, and then it fell apart completely not letting me do:

conda activate <other-env>

complaining with:

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If your shell is Bash or a Bourne variant, enable conda for the current user with

    $ echo ". /home/stas/anaconda3/envs/pytorch-dev/etc/profile.d/conda.sh" >> ~/.bashrc

or, for all users, enable conda with

    $ sudo ln -s /home/stas/anaconda3/envs/pytorch-dev/etc/profile.d/conda.sh /etc/profile.d/conda.sh

The options above will permanently enable the 'conda' command, but they do NOT
put conda's base (root) environment on PATH.  To do so, run

    $ conda activate

in your terminal, or to put the base environment on PATH permanently, run

    $ echo "conda activate" >> ~/.bashrc

Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
your ~/.bashrc file.  You should manually remove the line that looks like

    export PATH="/home/stas/anaconda3/envs/pytorch-dev/bin:$PATH"

^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^

Which was just plain wrong, since after I added what it recommended things got even worse. After some research I found that the correct change was to add this:

# get the conda env switching work
. /home/stas/anaconda3/etc/profile.d/conda.sh

to .bashrc instead. Of course yours would be adjusted for your /home/user.


(Scott H Hawley) #133

Managed to get it working! Turns out there were two slightly different modules for nvidia-396 attached to the kernel via depmod that showed up when I ran dkms status.
Deleting both of them and then reinstalling nvidia-396, and then rebooting, (and then repeating the usual fastai installation) did the trick.
Proof:

>>> from fastai import *
>>> torch.cuda.is_available()
True

Thanks for your help.

For those who may run into similar problems: I found this post from 2013 on purging NVIDIA drivers to be quite helpful.

EDIT: To be more specific on what I did…

  1. I did all of this via ssh into my machine. (I have no idea what happens when you remove an NVIDIA driver if you’re physically on the console.)

  2. I ran $ sudo apt list --installed | grep nvidia This was only showing nvidia-396 and no other drivers. This did not help. Also I ran $ sudo apt-get purge nvidia-* but this didn’t seem to help.

  3. I had previously installed CUDA 9.2 by downloading and running the linux “run script” from the NVIDIA website. So now, re-running this script (as root) with --uninstall eventually gave the message “There are no NVIDIA drivers on your system.” This still was not a complete solution.

  4. $ dkms status showed that there were two nvidia-396’s and matching versions of something called `bbswitch:

    $ dkms status
    bbswitch, 0.8, 4.4.0-135-generic, x86_64: installed
    bbswitch, 0.8, 4.4.0-137-generic, x86_64: installed
    nvidia-396, 396.54, 4.4.0-135-generic, x86_64: installed
    nvidia-396, 396.54, 4.4.0-137-generic, x86_64: installed

So I removed the older ones

$ sudo dkms remove bbswitch/0.8 -k 4.4.0-135-generic
$ sudo dkms remove nvidia-396/396.54 -k 4.4.0-135-generic
  1. Now the following few steps may have been overkill, but… Just to be sure, I removed nvidia-396 completely:
    $ sudo apt-get remove nvidia-396
    and then rebooted.
  2. At this point, there was no driver. nvidia-smi did not work, because it said there was no driver.
  3. Now I re-installed nvidia-396: $ sudo apt-get install nvidia-396 and even though it gave messages about cuda 9.0 and 9.1 and nothing about 9.2, I persisted.
  4. Rebooted. Now nvidia-smi worked.
  5. Installed fast.ai into a fresh conda environment as directed. Everything works. :slight_smile:

(Stas Bekman) #134

awesome!

but can you be more specific of how you diagnosed it with commands and everything so that others could follow your steps, remember most people will have no idea what you did to solve it. How come you had two nvidia modules in your kernel? what were the names in depmod output, etc. I hope you understand what I’m after: Here are the steps to identify and solve this problem: …

The article you linked to is only helpful for fixing the issue, what we need is clear steps for identifying the problem.

Thank you.

One crucial step from the article you linked I took is to add to the troubleshooting docs is:

Purge, not just remove, all installed NVIDIA packages:

1 `$ sudo apt-get purge nvidia-*`

So this is already very useful.


(Scott H Hawley) #135

Ok, I’ll edit my earlier post and add more information.


(Stas Bekman) #136

Thank you for taking the time to write out the steps, @drscotthawley! that’s excellent.

Any nvidia experts here? When I run this on my machine I too get doubles:

$ dkms status
bbswitch, 0.8, 4.15.0-34-generic, x86_64: installed
bbswitch, 0.8, 4.15.0-36-generic, x86_64: installed
nvidia-396, 396.44, 4.15.0-34-generic, x86_64: installed
nvidia-396, 396.44, 4.15.0-36-generic, x86_64: installed
virtualbox, 5.2.10, 4.15.0-34-generic, x86_64: installed
virtualbox, 5.2.10, 4.15.0-36-generic, x86_64: installed

yet, I don’t have this problem with pytorch. Perhaps this particular double is normal and not a harbinger of this issue? Anybody with nvidia knowledge to support?

It is possible that this was really the key?

sudo apt-get purge nvidia-*

I think it’s the one that will also remove dkms modules, right?

I guess it would be hard for you to tell now since the original setup is no more.


(Cesar) #137

I ran into this issue as well, solved it by downgrading notebook:

conda install notebook==5.6.0

There seems to be an issue on latest 5.7.0 see https://github.com/jupyter/notebook/issues/3946


(Xu Fei) #138

Thanks for the pointer and the explanation. However, this method assumes you want to install the full version of anaconda. I installed miniconda instead of the full anaconda as the full version is too heavy and many libraries aren’t used in fastai.

For people who want to continue using miniconda and install only the libraries necessary for the class,
conda update conda
then
conda update -c fastai fastai
worked for me. I installed my conda env using Python 3.6 and it only updated to 4.5.11 instead of 5.3.0 (suppose it’s based on Python 3.7), but it was able to update fastai to the latest stable version of 1.0.11


(Stas Bekman) #139

you’re correct, @shoof. I made an assumption you were on anaconda.

The dependencies were adjusted last night, so the next fastai release will fix it for you automatically.

Thank you for your feedback.


(Stas Bekman) #140

what is it showing now after you purged, reinstalled the nvidia drivers and rebootted, @drscotthawley.


(Nikhil) #141

I have added few steps in this forum which might be useful
https://forums.fast.ai/t/platform-gcp/27375/58?u=nikhil.ikhar


(Stas Bekman) #142

I have integrated your input here, @drscotthawley:
https://docs-dev.fast.ai/troubleshoot.html#correctly-configured-nvidia-drivers
Please, let me know if there is anything else that would be useful to add.


(Scott H Hawley) #143

As of this morning, without having rebooted (to my knowledge – sometimes my workplace has random power outages at night) I’m seeing new duplicate versions of both bbswitch and nvidia-396

$ dkms status

bbswitch, 0.8, 4.4.0-137-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-138-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-137-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-138-generic, x86_64: installed

…and yet the fastai & pytorch libraries still working fine with no CUDA errors. So, it seems that having those duplicates was not the key problem. That part of the troubleshooting instructions could probably be removed.

Perhaps it was the sudo apt-get purge nvidia-* that actually was responsible for the fix. I apologize for not cataloguing my actions more systematically as I was doing them; I was just trying to get something working.

The general strategy for users, is the principle of removing everything NVIDIA & CUDA-related (including perhaps even downloading the various NVIDIA & CUDA installation scripts but running them with the --uninstall option) until utilities such as nvidia-smi no longer work. (This may include rebooting one or more times). At that point, it is safe to install the nvidia-xxx driver again, and reboot, and install pytorch and fastaiv1.