Share GPU over multiple VMs


(Vinay Kumar) #1

Hi,
I have Nvidia Titan V100 GPU and I am trying to share it over multiple VMs.
How should I do it?

I tried with VirtualBox, but it diesnot detect nvidia-smi even though it says it has corrctly installed nvidia drivers.
Also when i do lspci my VM doesnot show the Nvidia GPUs.

How should I proceed?
Any help, please.
Thanks


(Willismar Medeiros) #2

Hi @vinaykumar2491,

Are you using Linux / Ubuntu ?

If yes the answer is very simples if you don’t the answer is you can’t … Let me know so I can help you out.


(Willismar Medeiros) #3

I forgot to mention that VirtualBox or VMWare cannot do PCI Passthrough. To archive that your only way is to use containers.

Personally I use Linux Container (LXC) instead of Docker because it achieve zero latency.
Container enables you to pass the GPUs to the container. Since I discovered the toy I never installed anything on my host machine, I just configure a new environment and use the software inside the container.
Here is my basic tutorial for it.

You will need to install snapd package manager (a new way to install software in linux that nobody can temper it’s file system) To do so you need some pre-requisites.

1 - Prerequisites

NOTE: If you have Ubuntu as base OS, remove the previous LXD so you can install the lastest version from snap

1-Install the Nvidia Drivers properly: In the moment of this tutorial the nvidia-410 was the latest version.
Note: Do not use the driver from the “NVIDIA-Linux-x86_64-XXX.XXX.run” installer, it probably won’t work

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt upgrade
sudo apt install libcuda1-410 libxnvctrl0 nvidia-410 nvidia-410-dev nvidia-libopencl1-410 nvidia-opencl-icd-410 nvidia-settings 

2 -We will need nvidia-container-runtime so the HOST can communicate with the GPUs available

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install -yq nvidia-container-runtime

3 - Remove any older LXD from your machine (in case of Ubuntu)
This step is necessary because even in Ubuntu 18.04 is very old already (v3.0.2) and the actual is (v3.6)

sudo apt-get remove lxd lxd-client 

4 - Install Snapd package manager (available in many distros, etc)

## In case of Debian distros
sudo apt install snapd 
## In case of RedHat distros
sudo yum install snapd 

5 - Install LXD: (from now on everything is equal in all distros)

sudo snap install lxd --channel=stable

6 - Change the REFRESH TIME to last Friday of the month to not bother you with automatic updates:

sudo snap set lxd refresh.timer=fri5,07:00-08:10

7 - Give permission to root and your user to use the container

echo "$USER:1000000:65536" | sudo tee -a /etc/subuid /etc/subgid
echo "root:1000000:65536" | sudo tee -a /etc/subuid /etc/subgid
sudo usermod --append --groups lxd $USER

Extra:
NOTE: ZFS is considered the best file system until now. It’s opensource nowadays from initiative of Sun Microsystems and Oracle and cannot come pre-installed in any system so you may need to install it manually on your system. It will be the file system used inside the container. It has many features that the current Ext4 does not have, like snapshots, self healing. I suggest or use zfs or btrfs

Use this tutorial if you want enable it ZFS on Distros

8 - Start the server (will ask a bunch of questions to define your container environment)

lxd init

9 - The questions and answers

Would you like to use LXD clustering? (yes/no) [default=no]: no
Do you want to configure a new storage pool? (yes/no) [default=yes]: yes
Name of the new storage pool [default=default]: default

if ZFS isn’t available use BTRFS

Name of the storage backend to use (btrfs, ceph, dir, lvm, zfs) [default=zfs]: zfs
Create a new ZFS pool? (yes/no) [default=yes]: yes
Would you like to use an existing block device? (yes/no) [default=no]: no

To the next question if you have 100Gb free on your DISK or SSD I recommend to put the 90Gb so you never reach the limit

Size in GB of the new loop device (1GB minimum) [default=16GB]: 50
Would you like to connect to a MAAS server? (yes/no) [default=no]: no
Would you like to create a new local network bridge? (yes/no) [default=yes]: yes
What should the new bridge be called? [default=lxdbr0]: yes
What IPv4 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: auto
What IPv6 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: none
Would you like LXD to be available over the network? (yes/no) [default=no]: no
Address to bind LXD to (not including port) [default=all]: all
Port to bind LXD to [default=8443]: 8443
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]: yes
Would you like a YAML “lxd init” preseed to be printed? (yes/no) [default=no]: no

10 - May be it’s necessary to restart your machine once. To check if the server is online you can call:

lxc version

If the server is online and running you will have this output

Client version: 3.6
Server version: 3.6 <-- this answer show the service was started successful

2 - Launching a Container with GPU

1 - Launching a new container, and map your local user ID with the intenal “ubuntu” user ID:
I will use the name c1 as the container name from now on

lxc launch ubuntu:16.04 c1

2 - Stop the container to do more configurations

lxc stop c1

3 - Map your UID and GID to the UID and GID of the default user of the container (user ubuntu inside the container)
Later if you map your personal folder inside the container will not get troubles with changing permissions on your files.

echo "uid $(id -u) 1000\ngid $(id -g) 1000" | lxc config set c1 raw.idmap -

4 - Map some or all GPU(s) inside the Container and pass:
This command enable your HOST driver be available inside the container, so you don’t even need to have cuda installed on your Host machine , you can just install the driver on the host

lxc config set c1 nvidia.runtime true

this command maps only the specific GPU to the container, if you need both just remove the Id

lxc config device add c1 mygpu gpu id=0

5 - Start the container again:

lxc start c1

6 - Configure your password to the default user on the container:

lxc exec c1 -- bash -c 'passwd ubuntu'

7 - Test if your GPU is working inside the container already:

lxc exec c1 -- bash -c 'nvidia-smi'

8 - Go to the console of your container (you may need hit ENTER twice).

Use ubuntu user and your password created previously,

lxc console c1

NOTE: LXD has many commands that you will need to familiarize with.

lxc --help
Description:
  Command line client for LXD

  All of LXD's features can be driven through the various commands below.
  For help with any of those, simply call them with --help.

Usage:
  lxc [command]

Available Commands:
  alias       Manage command aliases
  cluster     Manage cluster members
  config      Manage container and server configuration options
  console     Attach to container consoles
  copy        Copy containers within or in between LXD instances
  delete      Delete containers and snapshots
  exec        Execute commands in containers
  export      Export container backups
  file        Manage files in containers
  help        Help about any command
  image       Manage images
  import      Import container backups
  info        Show container or server information
  launch      Create and start containers from images
  list        List containers
  move        Move containers within or in between LXD instances
  network     Manage and attach containers to networks
  operation   List, show and delete background operations
  profile     Manage profiles
  project     Manage projects
  publish     Publish containers as images
  remote      Manage the list of remote servers
  rename      Rename containers and snapshots
  restart     Restart containers
  restore     Restore containers from snapshots
  snapshot    Create container snapshots
  start       Start containers
  stop        Stop containers
  storage     Manage storage pools and volumes
  version     Show local and remote versions

Flags:
      --all           Show less common commands
      --debug         Show all debug messages
      --force-local   Force using the local unix socket
  -h, --help          Print help
  -q, --quiet         Don't show progress information
  -v, --verbose       Show all information messages
      --version       Print version number

Use "lxc [command] --help" for more information about a command.


(Vinay Kumar) #4

Currently I am trying following combinations:

  1. Setup-1:
    Base OS: centOS 7
    Using KVM for virtualization
    OS on VMs : ubuntu 16.04

  2. Setup-2:
    Base OS : centOS 7
    Using VM VirtualBox for virtualization
    OS on VMs : Ubuntu 16.04

Both Setup-1 and Setup-2 doesnot work. Not able to access Nvidia GPU.

Thanks.


(Willismar Medeiros) #5

KVM can have access to your GPU if your machine has VT-x and VT-d virtualization technology but the problem is that you need to configure a bunch of thins to make the PCI Passthrow and map the PCI addressess to the Virtual Machine…


(Willismar Medeiros) #6

I just saw you have CentOS as base system … so you can try follow this URL from the author of LXD in how to setup the LXD on CentOS then you can change accordingly my tutorial for your needs.


(Vinay Kumar) #7

In future I am planning to moving my Base OS to Ubuntu 16.04 if that helps.


(Willismar Medeiros) #8

Helps, because the process is very easy on Ubuntu, snapcraft already has pre-compiled package you need to launch your containers. And Ubuntu18.04 already comes with LXD installed :smiley:

In case of Centos you really need to compile your own LXD server I did it before because I was using Scientific Linux (Distro of RedHat) and I wanted to create a container of that Linux … so I had to do it manually.

It’s notthing hard but Ubuntu is much more cheaper. working with LXD it’s like having a cloud in house.

here is my containers running at moment:

+--------+---------+----------------------+------+------------+-----------+
|  NAME  |  STATE  |         IPV4         | IPV6 |    TYPE    | SNAPSHOTS |
+--------+---------+----------------------+------+------------+-----------+
| cuda10 | RUNNING | 192.168.1.149 (eth0) |      | PERSISTENT |           |
+--------+---------+----------------------+------+------------+-----------+
| fastai | RUNNING | 192.168.1.141 (eth0) |      | PERSISTENT |           |
+--------+---------+----------------------+------+------------+-----------+
| sl75   | STOPPED |                      |      | PERSISTENT |           |
+--------+---------+----------------------+------+------------+-----------+

Also this LXD approach , you don’t need even install driver on the Container, because the LXD using another nvidia library by pass that information to the container.


(Vinay Kumar) #9

Is this in reference to Base OS or OS instaled in VM?


(Willismar Medeiros) #10

Hi,

I looked back to the process, about KVM virtualization and I found that the process changed a little bit from the last years. I think this is more suitable for you. Try to follow.

https://www.server-world.info/en/note?os=Ubuntu_18.04&p=kvm&f=11

Good Luck


(Willismar Medeiros) #11

Also let me know if your use case involves GUI or just terminal will suffice.


(Josh Lee) #12

I would recommend using docker and the nvidia-docker runtime. I’m running a centos 7 with 4 gpus and am able to detect each one from tensorflow in multiple docker containers. Note the containers did not have to have the driver installed.

I’m using the 390.87 driver, CUDA 9.0, and CudNN 7.2 and tensorflow 1.10.

Hope this helps.


(Willismar Medeiros) #13

Hi @vinaykumar2491

Let me know if you find any problems. I use this method more than a year and I believe you will like too.
I also reviewed the text and made some changes.

http://forums.fast.ai/t/share-gpu-over-multiple-vms/24198/3


(Willismar Medeiros) #14

Hi @VanBantam

Yes Docker and LXC has the same properties, they use nvidia-container-runtime to make the PCI Passthrough possible to any container.

While Docker has a pseudo language to deal with building process, the docker is a container to virtualize processes or single applications or build them, LXC is a container that virtualize environments (like a virtual machine without a kernel) that uses your local kernel as the core kernel for it and acquire zero latency.


(Josh Lee) #15

@willismar Perfect! Adding learning about LXC to the queue.


(Vinay Kumar) #16

Hi @willismar, When I do above step my container doesnot start again. I’m getting the following error:

vinay@vinay-GPU:~$ lxc start lxcMEENET 
Error: Failed to run: /snap/lxd/current/bin/lxd forkstart lxcMEENET /var/snap/lxd/common/lxd/containers /var/snap/lxd/common/lxd/logs/lxcMEENET/lxc.conf: 
Try `lxc info --show-log lxcMEENET` for more info
vinay@vinay-GPU:~$ lxc info lxcMEENET --show-log
Name: lxcMEENET
Location: none
Remote: unix://
Architecture: x86_64
Created: 2018/10/16 06:56 UTC
Status: Stopped
Type: persistent
Profiles: default

Log:

lxc lxcMEENET 20181016070804.443 WARN     conf - conf.c:lxc_map_ids:2917 - newuidmap binary is missing
lxc lxcMEENET 20181016070804.444 WARN     conf - conf.c:lxc_map_ids:2923 - newgidmap binary is missing
lxc lxcMEENET 20181016070804.563 WARN     conf - conf.c:lxc_map_ids:2917 - newuidmap binary is missing
lxc lxcMEENET 20181016070804.563 WARN     conf - conf.c:lxc_map_ids:2923 - newgidmap binary is missing
lxc lxcMEENET 20181016070804.962 ERROR    conf - conf.c:run_buffer:353 - Script exited with status 1
lxc lxcMEENET 20181016070804.962 ERROR    conf - conf.c:lxc_setup:3601 - Failed to run mount hooks
lxc lxcMEENET 20181016070804.962 ERROR    start - start.c:do_start:1234 - Failed to setup container "lxcMEENET"
lxc lxcMEENET 20181016070804.962 ERROR    sync - sync.c:__sync_wait:59 - An error occurred in another process (expected sequence number 5)
lxc lxcMEENET 20181016070805.342 ERROR    start - start.c:__lxc_start:1910 - Failed to spawn container "lxcMEENET"
lxc lxcMEENET 20181016070805.342 ERROR    lxccontainer - lxccontainer.c:wait_on_daemonized_start:840 - Received container state "ABORTING" instead of "RUNNING"
lxc lxcMEENET 20181016070805.348 WARN     conf - conf.c:lxc_map_ids:2917 - newuidmap binary is missing
lxc lxcMEENET 20181016070805.349 WARN     conf - conf.c:lxc_map_ids:2923 - newgidmap binary is missing
lxc 20181016070805.375 WARN     commands - commands.c:lxc_cmd_rsp_recv:130 - Connection reset by peer - Failed to receive response for command "get_state"

I was able to start the container but as suggested in Step-3 and Step-4, I mapped uid and gid and now I can’t start.
What should I do. Help.
Thanks


(Willismar Medeiros) #17

I will contact you by private message…