Fastai v1 install issues thread

techjoey · October 23, 2018, 7:27pm

Hello! I am having trouble trying to run the Practical Deep Learning Class Development Environment setup in Paperspace. I was going to use this post to help me, but I received an error message when I clicked on “main” below Paperspace in the tread. The error message: “Sorry, you don’t have access to that topic!” Can you please give me access / open this up to the public? Thank you!

stas · October 23, 2018, 9:56pm

drscotthawley:

As of this morning, without having rebooted (to my knowledge – sometimes my workplace has random power outages at night) I’m seeing new duplicate versions of both bbswitch and nvidia-396
$ dkms status

bbswitch, 0.8, 4.4.0-137-generic, x86_64: installed
bbswitch, 0.8, 4.4.0-138-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-137-generic, x86_64: installed
nvidia-396, 396.54, 4.4.0-138-generic, x86_64: installed
…and yet the fastai & pytorch libraries still working fine with no CUDA errors. So, it seems that having those duplicates was not the key problem. That part of the troubleshooting instructions could probably be removed.

That’s good to know. I wonder if someone here knows why the double entries with a slight change. perhaps those are just last 2 kernel builds?

Perhaps it was the sudo apt-get purge nvidia-* that actually was responsible for the fix. I apologize for not cataloguing my actions more systematically as I was doing them; I was just trying to get something working.

most likely. I will leave the dkms section for now and adjust it later once I read more success/failure reports, since the article you originally posted was mentioning removing dkms modules manually. I don’t think sudo apt-get purge nvidia-* will be able to do that for things it didn’t create - and many people install NVIDIA drivers manually - it was the recommended way until recently at least.

The general strategy for users, is the principle of removing everything NVIDIA & CUDA-related (including perhaps even downloading the various NVIDIA & CUDA installation scripts but running them with the --uninstall option) until utilities such as nvidia-smi no longer work . (This may include rebooting one or more times). At that point, it is safe to install the nvidia-xxx driver again, and reboot, and install pytorch and fastaiv1.

That one is quite murky too. If you reboot before installing new modules, I think, Ubuntu, at least, may switch to noveau driver automatically and then you have to clean that up too. So perhaps cleaning up and reinstalling w/o reboot is a safer approach. Let’s see.

Thank you for your feedback and suggestions so far, @drscotthawley.

stas · October 23, 2018, 10:01pm

my apologies, I linked to the section that is open only for the class attendees. it has been fixed now. If you find good posts that are specific to paperspace please link them from the first post - anybody can edit it. Thank you. Just make sure to pay attention that they discussion fastai-1.0 - most pre-fall-2018 posts talk about fastai-0.7 which is not what you want. If you can’t find any please start a new thread instead.

balnazzar · October 23, 2018, 11:53pm

Mates, it is not strictly an “install issue” so I posted in another section, but my fastai v1 (sucessful) installation refuses to work properly.

This is the thread: Python refuses abiding by SIGKILL

Maybe the maintainers could take a look.

abi · October 24, 2018, 1:38am

Hi @stas ,

In the installation instruction you write:

However, note, that you most likely will need 396.xx+ driver for pytorch built with cuda92. For older drivers you will probably need to install pytorch with cuda90 or ever earlier.

Is the 396.xx driver a hard requirement for PyTorch 1.0 and/or fastai-v1 (used for the latest course)?

My machine currently has an NVIDIA-SMI 367.106 driver with CUDA 8.0. Can I install pytorch-nightly with the lastest fastai with my old 367 driver?

Didn’t want to mess up my current working driver setup. So paused updating the driver until I got some guidance. I have wasted days historically getting the drivers to install correctly and create separate working environments for fastai and keras.

Any guidance on if 396 driver is a hard requirement is appreciated.

stas · October 24, 2018, 2:26am

I did a bit of googling and this seems to be the requirements for the 3 cuda versions pytorch currently supports in binary form:

CUDA 8.0 requires NVIDIA 361+
CUDA 9.0 requires NVIDIA 384+
CUDA 9.2 requires NVIDIA 396+

So yes, you should be able to use it. Just change the install instruction to:

conda install -c pytorch pytorch-nightly cuda80

or:

pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu80/torch_nightly.html

update: I documented the requirements here.

update: I found the authoritative source with all of it, including cuda 10!

CUDA Toolkit	Linux x86_64 Driver Version	Windows x86_64 Driver Version
CUDA 10.0.130	>= 410.48	>= 411.31
CUDA 9.2 (9.2.148 Update 1)	>= 396.37	>= 398.26
CUDA 9.2 (9.2.88)	>= 396.26	>= 397.44
CUDA 9.1 (9.1.85)	>= 390.46	>= 391.29
CUDA 9.0 (9.0.76)	>= 384.81	>= 385.54
CUDA 8.0 (8.0.61 GA2)	>= 375.26	>= 376.51
CUDA 8.0 (8.0.44)	>= 367.48	>= 369.30
CUDA 7.5 (7.5.16)	>= 352.31	>= 353.66
CUDA 7.0 (7.0.28)	>= 346.46	>= 347.62

abi · October 24, 2018, 2:31am

Perfect, thanks!

Thank you for the installation trouble-shooting documentation. This is the best doc that I have come across for understanding GPU driver/CUDA/Cudnn/PyTorch/Fastai installation requirements. 10x better than nVidia’s own documentation

arisbw · October 24, 2018, 10:10am

Hello, does anyone succeed to install and use v1 in Colab? I tried to install the dependencies by using

!pip install torch_nightly -f https://download.pytorch.org/whl/nightly/cu90/torch_nightly.html
!pip install fastai

At first, I thought this will work because I can import fastai. But when I tried to do some training, this error appeared:

RuntimeError: DataLoader worker (pid 240) is killed by signal: Bus error.

This was weird because when I checked this error it’s probably caused by limited shared memory. But I immediately check GPU memory and it was barely used (<1 GB from 12 GB space). Is there something that I need to do first in order to use it in Colab?

sgugger · October 24, 2018, 2:04pm

This has been asked and answered many times, here for instance. The short answer is Colab doesn’t support pytorch v1 yet so it doesn’t support fastai v1.

stas · October 24, 2018, 9:19pm

Some users still have an issue with conda refusing to update/install beyond fastai-1.0.6, even after reverting numpy pin to what it was from day 0. If you’re one of them please help us diagnose the problem so that we could fix it.

To explain the problem: If you previously installed some conda package foo that for example requires spacy <=2.0.15, and now you’re trying to install fastai==1.0.12 which requires spacy==2.0.16 there is a conflict, and conda will search older fastai packages until it finds one where the dependencies don’t conflict. And fastai-1.0.6 happens to be that. So we need to find that package that interferes.

The only difference in conda dependencies that changed between 1.0.6 and 1.0.12 is:

- spacy ==2.0.16
- regex ==2018.8.29
- thinc ==6.12.0
- cymem ==2.0.2
- pyyaml

Could those who have this problem try to install these specific versions on your setup? i.e.:

conda install spacy==2.0.16
conda install regex==2018.8.29
conda install thinc==6.12.0
conda install cymem==2.0.2
conda install pyyaml

if any of these fails (i.e. conda refuses to install that package, or says it doesn’t have it), then rerun the same command after adding --debug to it. It will produce a lot of output. Please paste it on https://pastebin.com/ and send the link here. So for example:

conda install spacy==2.0.16 --debug >& out

and paste the contents of out to pastebin.

Thank you.

There is also a very simple solution. Create a fresh conda environment and don’t install anything there other than fastai and its requirements, and keep it that way. You will not have any such conflicts in the future.

pinging: @shoof, @jakcycsl since you reported this issue originally.

shoof · October 25, 2018, 12:57am

@stas I think providing the following info would help with troubleshooting:

conda version
python version used in conda
OS version

The problem I had was related to not able to get fastai > 1.0.6 when using conda 4.3.0 on Python 3.6, Ubuntu 18.04. Initially I tried to pip install the git repo directly to get the latest 1.0.12dev and it worked. I then followed your suggestion to update my conda to the latest version possible (4.5.11) for Python 3.6, and then I was able to update fastai to 1.0.11 (not 1.0.12). Both 1.0.11 and 1.0.12dev had no issues so far but I am using 1.0.11 now.

stas · October 25, 2018, 1:24am

That’s a very useful feedback, @shoof. Thank you.

So you’re saying that it was sufficient to update your conda to the latest and the problem was fixed.

And unlike anaconda, requiring a certain conda version should work for both miniconda and anaconda users, correct?

@jeremy, you earlier said you didn’t think we should pin min versions for pip and conda, but it looks like we may need to in this case, since most likely there is something in the older conda versions that sets pins that conflict with our dependencies. I certainly traced it to the anaconda package wrt lower numpy pin, which probably had in its bundle packages that pinned to a low version of numpy.

Unlike other packages I believe pinning install managers to higher versions is goodness and not enforcing on users anything weird (>= of course, and not ==). Especially, since if we don’t - we will have to require users to use their virtual env exclusively for fastai in order to avoid pin conflicts with any other package out there which might decided to pin down something we need a higher version of. So to me asking for the latest (or almost so) package manager for pypi and conda is a reasonable thing to do. That was also the reason why I bumped the pip requirement (but reverted since then)

But I’m also fine with not enforcing that, and just have the docs say - must update your pip/conda first before installing fastai, or do so if you have a problem - we already say that - except who reads the docs…

shoof · October 25, 2018, 1:40am

Yes.

My miniconda setup works. I suppose Anaconda should work too. If people use different Python versions, their conda versions may not update to the latest one (5.3.0 on Python 3.7 for now?) but so far Pyton 3.6 + miniconda 4.5.11 + Ubuntu 18.04 have no issues.

If for some reason people cannot update conda to the latest verison, a quick and dirty solution is to pip install from the github repo directly
pip install git+https://github.com/fastai/fastai/

This approach does not require conda update at all. I think it may not be sustainable throughout the course and may cause more issues with different setups though.

stas · October 25, 2018, 1:43am

Well, I realize my comment implied I was suggesting to always ask for the latest, but to clarify I only meant a newer version that we know works for all.

And yes, it works just fine on anaconda.

Thanks again.

jeremy · October 25, 2018, 2:35am

Not quite what I said (or at least not what I meant).

I think we shouldn’t ever pin things unless we explicitly know that a particular version is required to have fastai work correctly. I have a feeling however that pinning conda itself doesn’t work. A quick bit of googling suggests that may be the case.

stas · October 25, 2018, 3:15am

ok, so you’re making a critical distinction between “working correctly” and “automatically installing correctly”. So if the latter is not a requirement and can be documented instead then I understand your position better now. All is good then.

I will add a note not to pin or up the pins unless fastai can’t work correctly w/o it.

Thank you.

jeremy · October 25, 2018, 3:16am

I would certainly like that too - but unfortunately I don’t think you can automatically update conda using conda itself. (I would love to be wrong on that, since apparently we do need a more up to date version.)

shaun1 · October 25, 2018, 12:18pm

There seems to be differences in the install instructions on the docs page and the github readme page.

From the docs page:

conda install -c pytorch -c fastai fastai pytorch-nightly cuda92

From the github page:

conda install -c pytorch pytorch-nightly cuda92
conda install -c fastai torchvision-nightly
conda install -c fastai fastai

Is this intentional? Which would we follow? I’m currently going by the github page.

jeremy · October 25, 2018, 2:13pm

The github page is better. The docs page is a briefer version that saves space on the page, but makes debugging harder if something goes wrong.

tdoucet · October 27, 2018, 6:22am

I am having the same problem–after updating the nVidia driver. The driver itself works, and nvidia-smi sees it, but fastai does not find the GPU (because pytorch does not find it). I do not have a solution yet but I have a couple observations which might be useful.

The part that is failing seems to be pytorch loading module _C, and the error message is, as you indicate, that libcuda.so.1 is not found.

Here is the output of ldd run on that particular module:

todd@build ~ $ ldd ./anaconda3/envs/foo/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so
	linux-vdso.so.1 =>  (0x00007ffeae4d7000)
	libshm.so => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/libshm.so (0x00007f77c35b5000)
	libcudart-72ec04ea.so.9.2 => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/libcudart-72ec04ea.so.9.2 (0x00007f77c3348000)
	libnvToolsExt-3965bdd0.so.1 => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/libnvToolsExt-3965bdd0.so.1 (0x00007f77c313d000)
	libcaffe2.so => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/libcaffe2.so (0x00007f77c0c67000)
	libc10.so => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/libc10.so (0x00007f77c0a52000)
	libcaffe2_gpu.so => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so (0x00007f779482b000)
	libtorch.so.1 => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/libtorch.so.1 (0x00007f7793bd8000)
	libgcc_s.so.1 => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/../../../libgcc_s.so.1 (0x00007f7793bc3000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f7793982000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f77935b8000)
	/lib64/ld-linux-x86-64.so.2 (0x0000561479493000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f77933af000)
	libstdc++.so.6 => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/../../../../libstdc++.so.6 (0x00007f779326e000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f7792f65000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f7792d60000)
	libmkl_intel_lp64.so => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/../../../../libmkl_intel_lp64.so (0x00007f779222f000)
	libmkl_gnu_thread.so => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/../../../../libmkl_gnu_thread.so (0x00007f77909f6000)
	libmkl_core.so => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/../../../../libmkl_core.so (0x00007f778c8bc000)
	libcuda.so.1 => not found
	libnvrtc-3fc78a74.so.9.2 => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/libnvrtc-3fc78a74.so.9.2 (0x00007f778b2ad000)
	libgomp.so.1 => /home/todd/./anaconda3/envs/foo/lib/python3.6/site-packages/torch/lib/../../../../libgomp.so.1 (0x00007f778b286000)
	libcuda.so.1 => not found

As you can see, libcuda.so.1 is being referenced there, but it does not resolve successfully on my system (because it doesn’t exist after the re-install of the nVidia driver I guess). Note that there is also a reference to the cuda runtime 9.2, and that resolves successfully–it is part of the local pytorch installation.

I did the same ldd test on the official image on GCP, and in that one there is also a reference to libcuda.so.1, but that resolves because it exists on that image (in /usr/lib/x86_64-linux-gnu).

So, what I currently believe is that the pytorch bundled in the fastai library does indeed use its own internal cuda 9.2, but in addition it requires a libcuda.so.1, probably for some auxiliary purpose separate from its main use of cuda.

That’s where I’ll leave it for this evening.