Successful ubuntu 18.04 with iGPU for xserver and nvidia GPU for CUDA work setup

Yes, If I do that or dkpg -l | grep cuda I get an empty list
What gets installed with every nvidia-driver seems to be some kind of runtime and the binaries/usr/bin/nvidia-cuda-mps-control and nvidia-cuda-mps-server

http://manpages.ubuntu.com/manpages/bionic/en/man1/nvidia-cuda-mps-control.1.html

Those seem to manage the cuda part of the gpu. Then when I use conda to set up an environment, every version of e.g. pytorch comes with different cuda and cudnn packages. I seem to have the cuda toolkits for 8.0 and 9.0 running in different conda envs.
Those all show up if you do locate cuda.

Now there still is a dependency between the installed driver / cuda runtime and the conda packages. So I have just tried to install pytorch 0.4 with cuda 9.2 enabled, but that gives me a False for .cuda.is_available. That seems to be because the minimum driver version (and I assume the bundled cuda runtime) is the nvidia-396 driver for 9.2, so with the nvidia-390 that I have installed I can only use up to cuda 9.1.

I regoogled some stuff, this article is what I think made me even test this out, before I always just assumed I absolutely had to install the cuda and cudnn packages from nvidia (with all the hassle that comes with it)

from this:

The NVIDIA display drivers come with a CUDA runtime library. That’s so you can run CUDA accelerated programs without having CUDA installed on your system. That’s usually just what you want.

Didn’t know that before. But yes, that is just what I want :wink:

This is very good research, Marc. Thank you for sharing that.

Do you get cudnn as well w/o installing it separately? I understand it’s essential for a much faster execution in certain situations. And usually it requires a manual installation.

Well, if someone could explain to me how this cudnn “layer” actually works…
So - I can’t really answer that, because I don’t know how to really test or verify that. I run the following code for checking:

import torch
print('environment: fastai')
print('pytorch version:', torch.__version__)
print('cuda available? ->', torch.cuda.is_available(), ' - Version:',  torch.version.cuda)
print('cudnn enabled?  ->', torch.backends.cudnn.enabled,'  - Version:',  torch.backends.cudnn.version())

Now, for the fastai (not v1) environment, setup after fresh pull it gives me this:

environment: fastai
pytorch version: 0.3.1.post2
cuda available? -> True  - Version: 9.0.176
cudnn enabled?  -> True  - Version: 7005

That would lead me to think cudnn works correctly. But if I do it on my knowingly non-working pytorch cuda9.2 version in a different conda env I get this:

environment: pytorch92
pytorch version: 0.4.1
cuda available? -> False  - Version: 9.2.148
cudnn enabled?  -> True   - Version: 7104

This shows that “cudnn.enabled” doesn’t really mean anything :wink:
So - do you have a proposition of how to check that it really is working? Is there some testsetup that could be run, where it is clear that cudnn would perform better than cuda and this could be switched/activated to show a difference between a pure cuda run vs. a cudnn run of an optimizer/model?

This also shows, that it is absolutely possible to have different versions of cuda running on the same machine (which a lot of SO articles negate). (Caveat: not if you are trying to develop cuda applications, but as long as you run the “finished libraries” like pytorch in different versions…)

if you do locate libcudnn on your machine (at least on mine) this shows that in all the conda envs with pytorch this lib is present as .so etc. (in different versions accross different setups/envs). So I would expect this to be working also without installing the cudnn stuff manually from the nvidia-dev account pages…

You’re right, it does appear that pytorch installs its own copy of cudnn via conda.

Running your test script in 2 environments indicates that the different versions co-exist quite nicely.

pytorch version: 0.4.1
cuda available? -> True  - Version: 9.2.148
cudnn enabled?  -> True   - Version: 7104

pytorch version: 0.3.1
cuda available? -> True  - Version: 8.0.61
cudnn enabled?  -> True   - Version: 7005

I’m not sure how to measure the advantage of having cudnn at the moment. But in your case it appears enabled in both environments, so it is nothing to worry about. I don’t think it doesn’t mean anything as you say. It’s probably just an independent library that enhances libcuda, and thus may be enabled on its own. You won’t be able to take advantage of it though w/o cuda enabled (if I understand it correctly).

Yeah, “doesn’t mean anything” was not really what I meant. But it seems not to make sense that its state is kind of “on” while the underlying technology (cuda) isn’t even available… It’s like your GPU shows “ready” while your computer is switched off. :wink:

Feeling adventurous this morning I actually upgraded my ubuntu nvidia driver version to 396.51 (released yesterday) without major problems.
After doing so, now my script above shows a “True” for cuda-available on the 9.2 conda env. So that shows that dependency between installed driver and cuda version. And without changing anything in the conda env updating the driver is enough to make it work.

1 Like

"NEW" nvidia-headless drivers available

While just upgrading my nvidia driver from 390 to 396 in order to be able to use cuda 9.2 I realized new driver versions that are available from the graphics-drivers/ppa on ubuntu. Those are just what we are looking for in this thread I think :wink:

mpr@x15 >> apt show nvidia-headless-396

Package: nvidia-headless-396
Version: 396.51-0ubuntu0~gpu18.04.1
Priority: optional
Section: libs
Source: nvidia-graphics-drivers-396
Maintainer: Ubuntu Core Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Installed-Size: 20,5 kB
Depends: nvidia-headless-no-dkms-396, nvidia-dkms-396
Download-Size: 8.492 B
APT-Sources: http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic/main amd64 Packages
Description: NVIDIA headless metapackage
 This metapackage installs the NVIDIA driver and the libraries that enable
 parallel general purpose computation through CUDA and
 OpenCL.
 .
 Install this package if you do not need X11 or Wayland support, which is
 provided by the nvidia-driver-396 metapackage.

The last line is what this thread is about, so maybe this is the new simplest form of getting this to work.
Have explicitly NOT tested this myself yet, this just as a heads-up. At least this way you have the option to avoid the xserver driver installation, which was previously only possible via runfiles and the reason people suggested using those in this thread.

2 Likes

This is just awesome! Finally it’s here! Thank you for this great news, Marc.

Yes, that’s a great find. Thanks @marcmuc. I’m curious if that package also correctly initializes the driver and takes care of nvidia-persistenced. Whoever tries out installing this headless driver, would you mind running nvidia-smi to check for the powerstate the card is in after booting?

I think my gpu is not being utilized properly. While i am trying to train with following code according to the lesson 1 of fastai part 1 the process of training is too slow

arch=resnet34
data=ImageClassifierData.from_paths(PATH,tfms=tfms_from_model(arch,sz))
learn=ConvLearner.pretrained(arch,data,precompute=True)
learn.fit(0.01,3)
0%| | 1/360 [03:09<18:51:16, 189.07s/it]

Here is the gpu status while training

Please someone help me out with the current situation.

Problem Solved
######################
cuda 9.1 was causing the problem. I uninstalled cuda9.1 and pytorch from the anaconda environment and after that firstly i installed cuda9.0 and then pytorch. This solved my problem.

Are you guys messing with your machines or doing this in a container ?
Btw at official nvidia cuda repository already has the nvidia-headless-410

Has anybody tried those, yet? In particular can you run nvidia-settings on those to configure things like coolbits without needing to temporarily setup a real xserver, running it to configure the card and then shutting it down?

Thanks.

This looks interesting: https://lambdalabs.com/lambda-stack-deep-learning-software

What is Lambda Stack?

Lambda Stack provides an easy way to install popular Machine Learning frameworks. With Lambda Stack, you can use apt / aptitude to install TensorFlow, Keras, PyTorch, Caffe, Caffe 2, Theano, CUDA, cuDNN, and NVIDIA GPU drivers. If a new version of any framework is released, Lambda Stack manages the upgrade. You’ll never run into issues with your NVIDIA drivers again.

Lambda Stack is both a system wide package, a Dockerfile, and a Docker image.

No need to compromise. Lambda Stack is not only a system wide installation of all of your favorite frameworks and drivers but also a convenient “everything included” deep learning Docker image. Now you’ll have your team up and running with GPU-accelerated Docker images in minutes instead of weeks. To learn more about how to set up Lambda Stack GPU Dockerfiles check out our tutorial:
https://lambdalabs.com/blog/set-up-a-tensorflow-gpu-docker-container-using-lambda-stack-dockerfile/
…

Hi, I also have a working setup, which seems to differ from the methods shown here, so for completeness here it is (just posted this as an answer in another thread, then thought this might be worth a topic on its own, just to find it existed already;-) .This uses no runfiles or compiliation but works with everything “out of the box” / apt packages etc. so maybe easier for linux-inexperienced folks.

Tested with nvidia-driver-418 :
This one seems to work really well for me too…! No activity on my NVIDIA GPU for display, PyTorch variables are able to access it though, and I’m able to track usage using nvidia-smi. Thank you so much!

I tried the headless driver, but it straight up refused to detect the NVIDIA GPU, giving the same error NVIDIA SMI cant communicate with GPU, please install drivers again (paraphrasing). So I switched back to @marcmuc’s solution.

1 Like

Hi @vikky2904,

Try this

Bare Metal

sudo apt install -y --no-install-recommends \
                    nvidia-headless-410 nvidia-utils-410

or

sudo apt install -y --no-install-recommends \
                    nvidia-headless-418 nvidia-utils-418

Containers

sudo apt install -y --no-install-recommends \
                    nvidia-headless-no-dkms-410 nvidia-utils-410

or

sudo apt install -y --no-install-recommends \
                    nvidia-headless-no-dkms-418 nvidia-utils-418

Hi @stas

I tried before but the problem is nvidia-settings and nvidia-xconfig both depends on Xserver
But if somebody has a external cooling system can easly just use the headles, nvidia-smi and toolkit

Thank you for the feedback, @willismar. It appears that this is an almost pointless waste of time trying to move to nvidia-headless then, if one doesn’t really gain anything - let’s hope it’s just the beginning and they will eventually come around at having a full support for nvidia config w/o requiring xserver running on it.

2 Likes

Just for the record, even the dgx station I use at work has an x server (with xfce) running upon one of the teslas, which shoudn’t be, since it somewhat hinders that specific card, making it a bit slower than the other three.

I just read the whole thread, and it is quite interesting. It seems you guys made a lot of research, and I think you can help me with my personal setup.

I got a pc with ubuntu 18.04 and two 1080ti, while the display is driven by a tiny radeon 4550 on a x4 slot.

Now, I was wondering if I can gain manual control of the fans, but setting the coolbits seems to be a tricky job, since 18.04 does not even have an /etc/X11/xorg.conf, and if you use nvidia-xconfig to set the coolbits, it creates such file doing a mess (you have to ssh and remove it).

Gaining such control would be really useful for me, since one of my card regularly hits 89C during long unfrozen training sessions, while its fan scarcely exceeds 65%…

Speaking of temperatures, and since you mentioned power states, etc…, look a this:

One of my cards always uses more power than the other (it is the one running hot under load) even at total idle, no matter the two card being identical (FE). Is there any possible explanation of such behaviour?
Note also that both of the are in P8 at idle, while P0 would be desirable.

Thanks a lot I used the link ‘this guide’ in the first link you posted at the top of the topic and now I have this all install along with an update video card.

Some of my response were not quite the same as posted in the ‘this guide’ link which I will add here in due time. Thanks again this sorts of guidances are priceless in the amount of time they save

I reinstalled my system after I messed it up with using prime-select intel to select the onboard graphics card: https://askubuntu.com/questions/980875/isolate-integrated-intel-igpu-from-nvidia-gpu
Switching then back to the GPU resulted in no picture on the monitor.

Then, I reinstalled the Ubuntu and tried to install CUDA similar to this guide for TF: https://www.pugetsystems.com/labs/hpc/Install-TensorFlow-with-GPU-Support-the-Easy-Way-on-Ubuntu-18-04-without-installing-CUDA-1170/
(Also to avoid using the GPU for the system monitor) But I only got CUDA false.

Finally, I reinstalled the entire system with Ubuntu and CUDA in the traditional way (reinstall ubuntu, reinstall cuda, reinstall pytorch/fastai): https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
And now I have cuda. :smiley:

Is this the recommended way for the current PyTorch/fastai version?
A working anaconda package would make life really a lot easier… :slight_smile: