Share GPU over multiple VMs

willismar · October 4, 2018, 3:02pm

I forgot to mention that VirtualBox or VMWare cannot do PCI Passthrough. To archive that your only way is to use containers.

Personally I use Linux Container (LXC) instead of Docker because it achieve zero latency.
Container enables you to pass the GPUs to the container. Since I discovered the toy I never installed anything on my host machine, I just configure a new environment and use the software inside the container.
Here is my basic tutorial for it.

You will need to install snapd package manager (a new way to install software in linux that nobody can temper it’s file system) To do so you need some pre-requisites.

1 - Prerequisites

NOTE: If you have Ubuntu as base OS, remove the previous LXD so you can install the lastest version from snap

1-Install the Nvidia Drivers properly: In the moment of this tutorial the nvidia-410 was the latest version.
Note: Do not use the driver from the “NVIDIA-Linux-x86_64-XXX.XXX.run” installer, it probably won’t work

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt upgrade
sudo apt install libcuda1-410 libxnvctrl0 nvidia-410 nvidia-410-dev nvidia-libopencl1-410 nvidia-opencl-icd-410 nvidia-settings

2 -We will need nvidia-container-runtime so the HOST can communicate with the GPUs available

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install -yq nvidia-container-runtime

3 - Remove any older LXD from your machine (in case of Ubuntu)
This step is necessary because even in Ubuntu 18.04 is very old already (v3.0.2) and the actual is (v3.6)

sudo apt-get remove lxd lxd-client

4 - Install Snapd package manager (available in many distros, etc)

## In case of Debian distros
sudo apt install snapd

## In case of RedHat distros
sudo yum install snapd

5 - Install LXD: (from now on everything is equal in all distros)

sudo snap install lxd --channel=stable

6 - Change the REFRESH TIME to last Friday of the month to not bother you with automatic updates:

sudo snap set lxd refresh.timer=fri5,07:00-08:10

7 - Give permission to root and your user to use the container

echo "$USER:1000000:65536" | sudo tee -a /etc/subuid /etc/subgid
echo "root:1000000:65536" | sudo tee -a /etc/subuid /etc/subgid
sudo usermod --append --groups lxd $USER

Extra:
NOTE: ZFS is considered the best file system until now. It’s opensource nowadays from initiative of Sun Microsystems and Oracle and cannot come pre-installed in any system so you may need to install it manually on your system. It will be the file system used inside the container. It has many features that the current Ext4 does not have, like snapshots, self healing. I suggest or use zfs or btrfs

Use this tutorial if you want enable it ZFS on Distros

8 - Start the server (will ask a bunch of questions to define your container environment)

lxd init

9 - The questions and answers

Would you like to use LXD clustering? (yes/no) [default=no]: no
Do you want to configure a new storage pool? (yes/no) [default=yes]: yes
Name of the new storage pool [default=default]: default

if ZFS isn’t available use BTRFS

Name of the storage backend to use (btrfs, ceph, dir, lvm, zfs) [default=zfs]: zfs
Create a new ZFS pool? (yes/no) [default=yes]: yes
Would you like to use an existing block device? (yes/no) [default=no]: no

To the next question if you have 100Gb free on your DISK or SSD I recommend to put the 90Gb so you never reach the limit

Size in GB of the new loop device (1GB minimum) [default=16GB]: 50
Would you like to connect to a MAAS server? (yes/no) [default=no]: no
Would you like to create a new local network bridge? (yes/no) [default=yes]: yes
What should the new bridge be called? [default=lxdbr0]: yes
What IPv4 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: auto
What IPv6 address should be used? (CIDR subnet notation, “auto” or “none”) [default=auto]: none
Would you like LXD to be available over the network? (yes/no) [default=no]: no
Address to bind LXD to (not including port) [default=all]: all
Port to bind LXD to [default=8443]: 8443
Would you like stale cached images to be updated automatically? (yes/no) [default=yes]: yes
Would you like a YAML “lxd init” preseed to be printed? (yes/no) [default=no]: no

10 - May be it’s necessary to restart your machine once. To check if the server is online you can call:

lxc version

If the server is online and running you will have this output

Client version: 3.6
Server version: 3.6 ← this answer show the service was started successful

2 - Launching a Container with GPU

1 - Launching a new container, and map your local user ID with the intenal “ubuntu” user ID:
I will use the name c1 as the container name from now on

lxc launch ubuntu:16.04 c1

2 - Stop the container to do more configurations

lxc stop c1

3 - Map your UID and GID to the UID and GID of the default user of the container (user ubuntu inside the container)
Later if you map your personal folder inside the container will not get troubles with changing permissions on your files.

echo "uid $(id -u) 1000\ngid $(id -g) 1000" | lxc config set c1 raw.idmap -

4 - Map some or all GPU(s) inside the Container and pass:
This command enable your HOST driver be available inside the container, so you don’t even need to have cuda installed on your Host machine , you can just install the driver on the host

lxc config set c1 nvidia.runtime true

this command maps only the specific GPU to the container, if you need both just remove the Id

lxc config device add c1 mygpu gpu id=0

5 - Start the container again:

lxc start c1

6 - Configure your password to the default user on the container:

lxc exec c1 -- bash -c 'passwd ubuntu'

7 - Test if your GPU is working inside the container already:

lxc exec c1 -- bash -c 'nvidia-smi'

8 - Go to the console of your container (you may need hit ENTER twice).

Use ubuntu user and your password created previously,

lxc console c1

NOTE: LXD has many commands that you will need to familiarize with.

lxc --help

Description:
  Command line client for LXD

  All of LXD's features can be driven through the various commands below.
  For help with any of those, simply call them with --help.

Usage:
  lxc [command]

Available Commands:
  alias       Manage command aliases
  cluster     Manage cluster members
  config      Manage container and server configuration options
  console     Attach to container consoles
  copy        Copy containers within or in between LXD instances
  delete      Delete containers and snapshots
  exec        Execute commands in containers
  export      Export container backups
  file        Manage files in containers
  help        Help about any command
  image       Manage images
  import      Import container backups
  info        Show container or server information
  launch      Create and start containers from images
  list        List containers
  move        Move containers within or in between LXD instances
  network     Manage and attach containers to networks
  operation   List, show and delete background operations
  profile     Manage profiles
  project     Manage projects
  publish     Publish containers as images
  remote      Manage the list of remote servers
  rename      Rename containers and snapshots
  restart     Restart containers
  restore     Restore containers from snapshots
  snapshot    Create container snapshots
  start       Start containers
  stop        Stop containers
  storage     Manage storage pools and volumes
  version     Show local and remote versions

Flags:
      --all           Show less common commands
      --debug         Show all debug messages
      --force-local   Force using the local unix socket
  -h, --help          Print help
  -q, --quiet         Don't show progress information
  -v, --verbose       Show all information messages
      --version       Print version number

Use "lxc [command] --help" for more information about a command.