Successful ubuntu 18.04 with iGPU for xserver and nvidia GPU for CUDA work setup

stas · August 1, 2018, 9:32pm

FYI, I have just successfully installed a fresh Kubuntu 18.04.1 with CUDA 9.2 in a dual setup - onboard intel GPU for the screen, and Nvidia GPU for ML CUDA work. All using apt packages. You can see all the details here, including a resolution for a hiccup with trying to get xorg to run on iGPU, instead of the Nvidia GPU. https://askubuntu.com/questions/1061551/configuring-igpu-for-xserver-and-nvidia-gpu-for-cuda-work-kubuntu-18-04-cuda-9

Both fastai and fastai_v1 run on it w/o any problems.

I was able to build dev version pytorch-0.5 (for fastai_v1) from source https://github.com/pytorch/pytorch#from-source using this setup and it works:

import torch
torch.version.cuda
torch.__version__

'9.2.148'
'0.5.0a0+59af5b9'

but you can just conda install a prebuilt pytorch-0.4 (that’s if you want to experiment with fastai-v1, otherwise it’s still pytorch=0.3 for the current codebase).

So, it’s probably OK to start migrating to 18.04 if you want to.

JensF · August 6, 2018, 3:19am

I’m also running Ubuntu 18.04 in a similar dual setup with an Intel onboard GPU. I tried a bunch of options to make the Intel gfx card the primary one in order to maximize GPU memory in the Nvidia card. When using the apt-get based driver installation, you will have to blacklist the nvidia-drm driver in /etc/modprobe.d/blacklist-nvidia.conf and also remove the nvidia card from /etc/X11/xorg.conf as explained in the link above. All in all it felt like a very brittle operation to make my Ubuntu use the Intel card.

After lots of experimenting, the best option I found was to install the Nvidia driver from http://us.download.nvidia.com/XFree86/Linux-x86/396.24/README/installdriver.html

The important step is to disable installation of any X Server support:
sudo sh NVIDIA-Linux-x86_64-396.24.run --no-opengl-files --dkms --no-drm

This is fine for me since I don’t want any GUI running on the NVidia card but instead purely use it for Cuda processing.

stas · August 6, 2018, 4:07pm

Thanks for sharing your approach @JensF.

Yes, this is the option I tried last time (installing from RUN file), and it does seem much cleaner as you’re saying. What I’m not sure about is what happens when linux kernel gets updated - do you need to rebuild CUDA drivers again? I’m experimenting this time with prebuit apt packages to see whether it’d be easier as they would get updated automatically. But otherwise, not using apt packages is definitely a simpler cleaner setup for dual video card and a all of the Nvidia card made available for CUDA work. If my current setup proves to be problematic I will revert to yours.

JensF · August 7, 2018, 4:54am

Hi Stas,

I’m using DKMS to handle rebuilding and installing the nvidia kernel module when the kernel is changed (through the --dkms parameter during installation). So far this worked well.
See also here: http://us.download.nvidia.com/XFree86/Linux-x86/396.24/README/installdriver.html#RegisteringTheNda02d

stas · August 7, 2018, 5:12am

Aha! dkms! That’s the piece of the puzzle that I was missing when reading through gazillions of cuda-failed-help-me threads Thank you so much, @JensF!

JensF · August 7, 2018, 6:03am

Glad I could help. Here’s one more thing that might be important: In my setup, the nvidia card defaulted to power mode P0 (check with nvidia-smi). According to my research, this mode is not available on my GTX 1080 TI when running in CUDA mode. It should run in mode P2.

Having the card in this power mode had two consequences:

occasionally, executing any kind of CUDA command (including nvidia-smi) would freeze that process. After that, you could only recover to a useful CUDA mode by rebooting the machine
Even when idling, the card would draw lots of power

I saw the same kind of behaviour when using the apt-get based driver installation the moment I blacklisted the nvidia driver. My assumption is that the card is set into a proper power state when using the X server, which doesn’t seem to happen otherwise.

In order to fix this, the card must be brought into the correct power mode P2 using the nvidia-persistenced service. I do this at system startup using this config:

Create file

sudo vi /lib/systemd/system/nvidia-persistenced.service

Paste this content

[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
ExecStart=/usr/bin/nvidia-persistenced
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target

Then activate service:

sudo chmod u+x /lib/systemd/system/nvidia-persistenced.service
sudo systemctl start nvidia-persistenced.service
sudo systemctl enable nvidia-persistenced.service
sudo systemctl status nvidia-persistenced.service

This completely solved the freezing. I can now also see the card changing down to power mode P5 or P8 when idling, being much more energy conserving.

stas · August 8, 2018, 3:09am

Thank you for sharing that, @JensF. I checked my card (1070Ti) and it runs P8 when idle, and P2 when working.

But why did you stop at P8?

The definition of each P-States are currently as follows:

P0/P1 - Maximum 3D performance
P2/P3 - Balanced 3D performance-power
P8 - Basic HD video playback
P10 - DVD playback
P12 - Minimum idle power consumption

so don’t we want P12 on idle? I see my card on idle is in state P8 and consumes 11W, which is wasteful.

JensF · August 8, 2018, 3:39am

The switching between P2, P5, and P8 is done automatically by the card. Running nvidia-persistenced seems to just trigger the built-in power management behaviour. I did never observe it switch down to P12, but would love to see that too. My 1080 GTX TI consumes 15W when idling in P8.

marcmuc · August 21, 2018, 12:58pm

Hi, I also have a working setup, which seems to differ from the methods shown here, so for completeness here it is (just posted this as an answer in another thread, then thought this might be worth a topic on its own, just to find it existed already;-) .This uses no runfiles or compiliation but works with everything “out of the box” / apt packages etc. so maybe easier for linux-inexperienced folks.

(Install Ubuntu 18.04)
Install nvidia driver, reboot.
in a terminal use nvidia-smi to check the gpu is recognized correctly, note the memory consumption of up to 400MB just for the xserver GUI.
in the xserver desktop go to nvidia-settings-gui and change the PRIME Profile to “Intel (Power Saving Mode)”
reboot!
in a terminal use nvidia-smi, it should not find a gpu (unless you have multiple…) and give some error message

then edit /etc/modprobe.d/blacklist-nvidia.conf and delete or comment the following lines out:

# blacklist nvidia
blacklist nvidia-drm
# blacklist nvidia-modeset
# alias nvidia off
alias nvidia-drm off
# alias nvidia-modeset off

edit/create/add the following to /etc/modprobe.d/blacklist-nvidia-nouveau.conf:

blacklist nouveau
options nouveau modeset=0
(this step should have been done when first installing the gpu drivers or the disabling of the nouveau drivers was done in some different way, but add this just to be sure)
reboot
in a terminal use nvidia-smi, GPU should again be recognized and there should now be NO processes using the GPU and available memory should be the max of the card. At the same time the gui should be running (now powered by the internal Intel Graphics Chip). Yeah!
Note: On some desktops you might have to use a different monitor output (the one from the main board, not the one from the graphics card). On some it still somehow gets routed through…

This of course was not “invented” by me but is a summary of numerous SO articles I had to go through to get this working.
Currently works with the supported nvidia-390 (via apt install) driver, have not tested with 396 or higher yet.

marcmuc · August 21, 2018, 1:07pm

this above is my “entire” cuda setup by the way, so I have NOT installed CUDA or CUDNN from nvidia.
I have done this multiple times before, it was always a hassle. With the last fresh reinstall of 18.04 I just did the above, so no installation of cuda or cudnn packages or runfiles etc… Nevertheless, if I install anaconda and the fastai environment (or use pytorch in a different env) both CUDA and CUDNN work fine and are recognized.

After some research on the web I found that it seems to be absolutely not necessary to install the nvidia CUDA or CUDNN stuff on its own (unless you want to do lowlevel work in C/C++ on CUDA itself), as all the libraries shipped with anaconda seem to bring their own copys of the compiled libs/.mo etc. anyways. This was kind of a revelation for me as all the threads and how-tos here always mentioned installing those first manually and I have spent countless hours getting that to work in different ubuntu settings.
What is your take on that?

stas · August 21, 2018, 3:00pm

Sounds like a totally neat thing if you don’t need to install cuda libs anymore. But are you certain those didn’t just get installed behind the scenes as dependencies? Do you get nothing when you run:

$ apt list --installed | grep cuda

marcmuc · August 21, 2018, 7:10pm

Yes, If I do that or dkpg -l | grep cuda I get an empty list
What gets installed with every nvidia-driver seems to be some kind of runtime and the binaries /usr/bin/nvidia-cuda-mps-control and nvidia-cuda-mps-server

http://manpages.ubuntu.com/manpages/bionic/en/man1/nvidia-cuda-mps-control.1.html

Those seem to manage the cuda part of the gpu. Then when I use conda to set up an environment, every version of e.g. pytorch comes with different cuda and cudnn packages. I seem to have the cuda toolkits for 8.0 and 9.0 running in different conda envs.
Those all show up if you do locate cuda.

Now there still is a dependency between the installed driver / cuda runtime and the conda packages. So I have just tried to install pytorch 0.4 with cuda 9.2 enabled, but that gives me a False for .cuda.is_available. That seems to be because the minimum driver version (and I assume the bundled cuda runtime) is the nvidia-396 driver for 9.2, so with the nvidia-390 that I have installed I can only use up to cuda 9.1.

I regoogled some stuff, this article is what I think made me even test this out, before I always just assumed I absolutely had to install the cuda and cudnn packages from nvidia (with all the hassle that comes with it)

from this:

The NVIDIA display drivers come with a CUDA runtime library. That’s so you can run CUDA accelerated programs without having CUDA installed on your system. That’s usually just what you want.

Didn’t know that before. But yes, that is just what I want

stas · August 21, 2018, 8:13pm

This is very good research, Marc. Thank you for sharing that.

Do you get cudnn as well w/o installing it separately? I understand it’s essential for a much faster execution in certain situations. And usually it requires a manual installation.

marcmuc · August 21, 2018, 9:11pm

Well, if someone could explain to me how this cudnn “layer” actually works…
So - I can’t really answer that, because I don’t know how to really test or verify that. I run the following code for checking:

import torch
print('environment: fastai')
print('pytorch version:', torch.__version__)
print('cuda available? ->', torch.cuda.is_available(), ' - Version:',  torch.version.cuda)
print('cudnn enabled?  ->', torch.backends.cudnn.enabled,'  - Version:',  torch.backends.cudnn.version())

Now, for the fastai (not v1) environment, setup after fresh pull it gives me this:

environment: fastai
pytorch version: 0.3.1.post2
cuda available? -> True  - Version: 9.0.176
cudnn enabled?  -> True  - Version: 7005

That would lead me to think cudnn works correctly. But if I do it on my knowingly non-working pytorch cuda9.2 version in a different conda env I get this:

environment: pytorch92
pytorch version: 0.4.1
cuda available? -> False  - Version: 9.2.148
cudnn enabled?  -> True   - Version: 7104

This shows that “cudnn.enabled” doesn’t really mean anything
So - do you have a proposition of how to check that it really is working? Is there some testsetup that could be run, where it is clear that cudnn would perform better than cuda and this could be switched/activated to show a difference between a pure cuda run vs. a cudnn run of an optimizer/model?

This also shows, that it is absolutely possible to have different versions of cuda running on the same machine (which a lot of SO articles negate). (Caveat: not if you are trying to develop cuda applications, but as long as you run the “finished libraries” like pytorch in different versions…)

if you do locate libcudnn on your machine (at least on mine) this shows that in all the conda envs with pytorch this lib is present as .so etc. (in different versions accross different setups/envs). So I would expect this to be working also without installing the cudnn stuff manually from the nvidia-dev account pages…

stas · August 21, 2018, 9:22pm

You’re right, it does appear that pytorch installs its own copy of cudnn via conda.

Running your test script in 2 environments indicates that the different versions co-exist quite nicely.

pytorch version: 0.4.1
cuda available? -> True  - Version: 9.2.148
cudnn enabled?  -> True   - Version: 7104

pytorch version: 0.3.1
cuda available? -> True  - Version: 8.0.61
cudnn enabled?  -> True   - Version: 7005

I’m not sure how to measure the advantage of having cudnn at the moment. But in your case it appears enabled in both environments, so it is nothing to worry about. I don’t think it doesn’t mean anything as you say. It’s probably just an independent library that enhances libcuda, and thus may be enabled on its own. You won’t be able to take advantage of it though w/o cuda enabled (if I understand it correctly).

marcmuc · August 22, 2018, 10:14am

Yeah, “doesn’t mean anything” was not really what I meant. But it seems not to make sense that its state is kind of “on” while the underlying technology (cuda) isn’t even available… It’s like your GPU shows “ready” while your computer is switched off.

Feeling adventurous this morning I actually upgraded my ubuntu nvidia driver version to 396.51 (released yesterday) without major problems.
After doing so, now my script above shows a “True” for cuda-available on the 9.2 conda env. So that shows that dependency between installed driver and cuda version. And without changing anything in the conda env updating the driver is enough to make it work.

marcmuc · August 22, 2018, 10:19am

"NEW" nvidia-headless drivers available

While just upgrading my nvidia driver from 390 to 396 in order to be able to use cuda 9.2 I realized new driver versions that are available from the graphics-drivers/ppa on ubuntu. Those are just what we are looking for in this thread I think

mpr@x15 >> apt show nvidia-headless-396

Package: nvidia-headless-396
Version: 396.51-0ubuntu0~gpu18.04.1
Priority: optional
Section: libs
Source: nvidia-graphics-drivers-396
Maintainer: Ubuntu Core Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Installed-Size: 20,5 kB
Depends: nvidia-headless-no-dkms-396, nvidia-dkms-396
Download-Size: 8.492 B
APT-Sources: http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic/main amd64 Packages
Description: NVIDIA headless metapackage
 This metapackage installs the NVIDIA driver and the libraries that enable
 parallel general purpose computation through CUDA and
 OpenCL.
 .
 Install this package if you do not need X11 or Wayland support, which is
 provided by the nvidia-driver-396 metapackage.

The last line is what this thread is about, so maybe this is the new simplest form of getting this to work.
Have explicitly NOT tested this myself yet, this just as a heads-up. At least this way you have the option to avoid the xserver driver installation, which was previously only possible via runfiles and the reason people suggested using those in this thread.

stas · August 22, 2018, 5:24pm

This is just awesome! Finally it’s here! Thank you for this great news, Marc.

JensF · August 22, 2018, 7:27pm

Yes, that’s a great find. Thanks @marcmuc. I’m curious if that package also correctly initializes the driver and takes care of nvidia-persistenced. Whoever tries out installing this headless driver, would you mind running nvidia-smi to check for the powerstate the card is in after booting?

xrenon · October 4, 2018, 2:09am

I think my gpu is not being utilized properly. While i am trying to train with following code according to the lesson 1 of fastai part 1 the process of training is too slow

arch=resnet34
data=ImageClassifierData.from_paths(PATH,tfms=tfms_from_model(arch,sz))
learn=ConvLearner.pretrained(arch,data,precompute=True)
learn.fit(0.01,3)
0%| | 1/360 [03:09<18:51:16, 189.07s/it]

Here is the gpu status while training

Please someone help me out with the current situation.

Problem Solved
######################
cuda 9.1 was causing the problem. I uninstalled cuda9.1 and pytorch from the anaconda environment and after that firstly i installed cuda9.0 and then pytorch. This solved my problem.