Fastai v1 install issues thread

sorry if that’s the incorrect thread to reply on… do you know when pytorch1.0 is going to be ready in windows (rather than building myself)?

i have tried a whole day, installing vs2017 and follow the steps in pytorch website but can’t get it compiled…
i need to do a trial and error on installing cmake, msbuild, vs2017, v141…etc but still can’t get it done.

They said at the conference the v1 would really be out around NIPS though you can also ask that question on the pytorch forum :wink:

1 Like

Just wanted to post that I built a new Ubuntu 18.04 box today with a 1080ti and was able to get it all installed fairly straightforward(only 1 full OS re-install).

I mainly followed this guide - http://blog.jeffhaluska.com/adventures-in-installing-pytorch-in-ubuntu-18-04/

Instead of using ubuntu gui(first 2 steps) to install nvidia drivers I did it through apt install the ppa repo:

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt install nvidia-driver-410

I did conda install for pytorch with this:

conda install pytorch-nightly cuda92 -c pytorch

1 Like

how install the latest version of fastai?
running conda update fastai -c fastai will only install version 1.0.6

UPDATE: manage to install the latest version after doing these steps:

  1. conda update conda -y
  2. conda install anaconda -y

If you encounter any permission error, do this:
sudo chown -R user /home/user/src
where user is the username, /home/user/src is the folder the where anaconda is installed in Ubuntu

I’m getting “ImportError: libcuda.so.1: cannot open shared object file: No such file or directory”

As per the Troubleshooting thread, I my driver is
nvidia-396/xenial,now 396.54-0ubuntu0~gpu16.04.1 amd64 [installed]
, and nvidia-smi is working.

$ find /usr/ | grep libcuda.so
/usr/share/man/man7/libcuda.so.7
/usr/local/cuda-9.2/lib64/stubs/libcuda.so
/usr/local/cuda-9.2/doc/man/man7/libcuda.so.7

…these are from the CUDA that I just installed tonight in trying to get FastAI working.

On this Ubuntu 16.04 system, previously I was using CUDA 9.1 with CUDNN7.0 with PyTorch and it worked fine with my GPU. The things that are broken tonight are a result of trying to follow the Fast.ai installation instructions.

I’ve created a fresh conda environment, ran the 3 conda install lines to install pytorch and the fastai packages, but when I run the test line, I get

$ python -c 'import fastai; fastai.show_install(1)'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/anaconda/envs/fastai/lib/python3.6/site-packages/fastai/__init__.py", line 1, in <module>
from .basic_train import *
File "/opt/anaconda/envs/fastai/lib/python3.6/site-packages/fastai/basic_train.py", line 2, in <module>
from .torch_core import *
File "/opt/anaconda/envs/fastai/lib/python3.6/site-packages/fastai/torch_core.py", line 2, in <module>
from .imports.torch import *
File "/opt/anaconda/envs/fastai/lib/python3.6/site-packages/fastai/imports/__init__.py", line 2, in <module>
from .torch import *
File "/opt/anaconda/envs/fastai/lib/python3.6/site-packages/fastai/imports/torch.py", line 1, in <module>
import torch, torch.nn.functional as F
File "/opt/anaconda/envs/fastai/lib/python3.6/site-packages/torch/__init__.py", line 84, in <module>
from torch._C import *
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

Would appreciate any help! Thanks.

PS- I do need CUDA installed ‘normally’ on my system, because I also have a project that uses Keras that I need to run on this machine. (…Oh geez, and I removed CUDA 9.1 in doing the Troubleshooting, but my Tensorflow was built from source using 9.1!)

1 Like

There is an issue with the recent fastai conda packages if you have an outdated anaconda package.

This is caused by an outdated anaconda package, which wants a numpy < 1.15

do

conda install anaconda

which should install anaconda 5.3.0 or higher.

and then it should work.

I changed meta.yaml deps to not ask for numpy>=1.15, but rolled back to >=1.12 so the next release should take care of it.

1 Like

I think you need to update your drivers

1 Like

I would also suggest trying a reboot, that can sometimes fix things itself

1 Like

What @wdhorton, and as I said in the other thread unfortunately we don’t know how to help you, as it’s a pytorch issue. So I have been collecting notes from people who overcame that issue, but pytorch forums is really the best place to ask if what we collected so far wasn’t helpful.

Really you should be able to run import torch and only then try fastai. I hope this makes sense.

I have encountered this issue myself and indeed reinstalling the nvidia drivers (not cuda drivers) and rebooting solved the problem.

Remember, pytorch-1 doesn’t care what cuda drivers you have installed on your system. it fetches its own. so you can do anything you want system-wide.

And as you debug this please take notes and then share with everybody how you resolved this issue. Thank you.

Thanks for the suggestions @wdhorton .
Drivers are already the “396.xx+ driver” as required in the Troubleshooting guide.
Rebooted, but error persists.

This is not enough. You need to check that you don’t have other nvidia drivers installed (perhaps from manual install). That’s why the recommendation is to wipe your system clean first. Check that no nvidia driver is left (not just through package manager, but using locate or whatever else you to search your filesystem. Make sure nothing is left. then install a single nvidia driver. reboot and check then.

So this is a PyTorch nighty-build issue, to be specific.
Will FastAI work with a previous “main” release version of PyTorch, or does it require the nightly build?

Because as I said, I had no problem with PyTorch (0.4.0), and I could just go back to using that if Fast.ai will ‘talk to it’ appropriately.

pytorch nightly-build is pre-1.0 - it’s required by fastai-1.x

the other approach is that you can build pytorch from source - then it’ll use your system’s cuda. see pytorch.org for details it’s pretty straightforward. https://github.com/pytorch/pytorch#from-source

1 Like

Hi,
I am just starting the first lesson and I have a problem. The install seemed to go ok in that it can import fastai. But then it tries to import the images using this path
/home/pete/.fastai/data/dogscats
With that path it cannot find a “valid” directory.

I assume that this is a simple problem due to my unfamiliarity with the software. Can anyone help?

Thanks

Just had a weird fallout on my conda env switching/activating after updating to a recent conda version. It suggested that I need to remove:

export PATH="/home/stas/anaconda3/envs/pytorch-dev/bin:$PATH"

from .bashrc since it was only used prior to conda 4.4, which I did, and then it fell apart completely not letting me do:

conda activate <other-env>

complaining with:

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If your shell is Bash or a Bourne variant, enable conda for the current user with

    $ echo ". /home/stas/anaconda3/envs/pytorch-dev/etc/profile.d/conda.sh" >> ~/.bashrc

or, for all users, enable conda with

    $ sudo ln -s /home/stas/anaconda3/envs/pytorch-dev/etc/profile.d/conda.sh /etc/profile.d/conda.sh

The options above will permanently enable the 'conda' command, but they do NOT
put conda's base (root) environment on PATH.  To do so, run

    $ conda activate

in your terminal, or to put the base environment on PATH permanently, run

    $ echo "conda activate" >> ~/.bashrc

Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
your ~/.bashrc file.  You should manually remove the line that looks like

    export PATH="/home/stas/anaconda3/envs/pytorch-dev/bin:$PATH"

^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^

Which was just plain wrong, since after I added what it recommended things got even worse. After some research I found that the correct change was to add this:

# get the conda env switching work
. /home/stas/anaconda3/etc/profile.d/conda.sh

to .bashrc instead. Of course yours would be adjusted for your /home/user.

Managed to get it working! Turns out there were two slightly different modules for nvidia-396 attached to the kernel via depmod that showed up when I ran dkms status.
Deleting both of them and then reinstalling nvidia-396, and then rebooting, (and then repeating the usual fastai installation) did the trick.
Proof:

>>> from fastai import *
>>> torch.cuda.is_available()
True

Thanks for your help.

For those who may run into similar problems: I found this post from 2013 on purging NVIDIA drivers to be quite helpful.

EDIT: To be more specific on what I did…

  1. I did all of this via ssh into my machine. (I have no idea what happens when you remove an NVIDIA driver if you’re physically on the console.)

  2. I ran $ sudo apt list --installed | grep nvidia This was only showing nvidia-396 and no other drivers. This did not help. Also I ran $ sudo apt-get purge nvidia-* but this didn’t seem to help.

  3. I had previously installed CUDA 9.2 by downloading and running the linux “run script” from the NVIDIA website. So now, re-running this script (as root) with --uninstall eventually gave the message “There are no NVIDIA drivers on your system.” This still was not a complete solution.

  4. $ dkms status showed that there were two nvidia-396’s and matching versions of something called `bbswitch:

    $ dkms status
    bbswitch, 0.8, 4.4.0-135-generic, x86_64: installed
    bbswitch, 0.8, 4.4.0-137-generic, x86_64: installed
    nvidia-396, 396.54, 4.4.0-135-generic, x86_64: installed
    nvidia-396, 396.54, 4.4.0-137-generic, x86_64: installed

So I removed the older ones

$ sudo dkms remove bbswitch/0.8 -k 4.4.0-135-generic
$ sudo dkms remove nvidia-396/396.54 -k 4.4.0-135-generic
  1. Now the following few steps may have been overkill, but… Just to be sure, I removed nvidia-396 completely:
    $ sudo apt-get remove nvidia-396
    and then rebooted.
  2. At this point, there was no driver. nvidia-smi did not work, because it said there was no driver.
  3. Now I re-installed nvidia-396: $ sudo apt-get install nvidia-396 and even though it gave messages about cuda 9.0 and 9.1 and nothing about 9.2, I persisted.
  4. Rebooted. Now nvidia-smi worked.
  5. Installed fast.ai into a fresh conda environment as directed. Everything works. :slight_smile:

awesome!

but can you be more specific of how you diagnosed it with commands and everything so that others could follow your steps, remember most people will have no idea what you did to solve it. How come you had two nvidia modules in your kernel? what were the names in depmod output, etc. I hope you understand what I’m after: Here are the steps to identify and solve this problem: …

The article you linked to is only helpful for fixing the issue, what we need is clear steps for identifying the problem.

Thank you.

One crucial step from the article you linked I took is to add to the troubleshooting docs is:

Purge, not just remove, all installed NVIDIA packages:

1 `$ sudo apt-get purge nvidia-*`

So this is already very useful.

1 Like

Ok, I’ll edit my earlier post and add more information.

1 Like

Thank you for taking the time to write out the steps, @drscotthawley! that’s excellent.

Any nvidia experts here? When I run this on my machine I too get doubles:

$ dkms status
bbswitch, 0.8, 4.15.0-34-generic, x86_64: installed
bbswitch, 0.8, 4.15.0-36-generic, x86_64: installed
nvidia-396, 396.44, 4.15.0-34-generic, x86_64: installed
nvidia-396, 396.44, 4.15.0-36-generic, x86_64: installed
virtualbox, 5.2.10, 4.15.0-34-generic, x86_64: installed
virtualbox, 5.2.10, 4.15.0-36-generic, x86_64: installed

yet, I don’t have this problem with pytorch. Perhaps this particular double is normal and not a harbinger of this issue? Anybody with nvidia knowledge to support?

It is possible that this was really the key?

sudo apt-get purge nvidia-*

I think it’s the one that will also remove dkms modules, right?

I guess it would be hard for you to tell now since the original setup is no more.

I ran into this issue as well, solved it by downgrading notebook:

conda install notebook==5.6.0

There seems to be an issue on latest 5.7.0 see https://github.com/jupyter/notebook/issues/3946

1 Like