I am experimenting with running all my deep learning stacks inside Docker containers. I need CUDA and other libraries for Tensorflow, PyTorch and a Stereo Camera and all three have strict restrictions on CUDA versions. In addition, Docker will allow me to run all three and let me transfer the same development environment on AWS or another system.
I am using Fedora 26 on a Lenovo Legion Y520 that comes with a Nvidia 1050Ti GPU having 4GB memory. This approach only requires installing Nvida drivers no CUDA installation is needed, as the nvidia runtime contains CUDA 8.
For anyone interested, here are the steps I followed:
I disabled the default Nouveau drivers and installed latest Nvidia drivers (v.384.90) by following the this guide.
Installed Nvidia Docker v2.0. It supplies CUDA libraries to any docker image that runs with --runtime=nvidia (see step 7 for more details).
Signed up for Nvidia GPU Cloud (NGC) account. Signing up is free and required for downloading the PyTorch image (or any of the many other images available from NGC).
Used docker login nvcr.io to login in to nvcr.io repository. The username is “$oauthtoken” without the quotes and the password in the key generated in step 5.
Pulled the PyTorch docker image from nvcr.io using docker pull nvcr.io/nvidia/pytorch:17.10 .
Ran a docker container using the image using docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -v $(pwd):/workspace --rm -it nvcr.io/nvidia/pytorch:17.10 (more details here.)
Confirmed GPU availability inside the docker by running nvidia-smi and then tested CUDA capability by running the MNIST example in /worspace/examples/mnist.
Nvidia’s cloud GPU runs atop AWS. Personally I’m not a big fan of software built by traditional hardware companies. I’ve an AWS P2 instance and would use it or my laptop for the assignments. However I signed up for NGC just to get their prebuilt Docker image that I can run on any platform that has Nvida GPU drivers installed.
I followed the above instructions, the last thing I ran was docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -v $(pwd):/workspace --rm -it nvcr.io/nvidia/pytorch:17.10
So I’m inside the container now
okay, so you can install anything additionally needed as on regular machine (i don’t recall what exactly this image is missing for the fastai).
But to make changes persistent:
docker ps -a - to find out container id which you were working on
"docker commit /container id/ /your new imagename:tag/
Docker is a more advanced approach and there’s alway issues with Docker details and how it interacts with CUDA. So unless you’re prepared to spend some time debugging and are interested in learning about this in particular, I’d suggest avoiding Docker for deep learning.
Here is where I might need help. Where is that image stored? I would like to add the fast.ai library to it for when it is built. However, I can’t seem to find it on my computer.
Method B is faster, however I prefer method A because what’s going on in the image becomes transparent and easier to understand. If you’d like to try method A, I suggest reading some of @hamelsmu’s Docker tutorial and then checking out my Dockerfile for fast.ai and the accompanying README for reference as you write your own.
I am not sure you have to set this ENV variable. I believe by default all the devices are visible in the container, at least that is what nvidia-smi is showing me.