Fastai on AMD GPUs - Working dockerfile

Hey all,

Today I attempted to get fast.ai working on my AMD GPU at home. The result of this work is a docker image that anyone can use to replicate my setup. It provides fastai and a jupyter notebook with support for AMD GPU acceleration.


https://cloud.docker.com/repository/docker/briangorman/fastai_rocm

In theory, this setup should work on any modern Linux kernel with an appropriate CPU/GPU combo. No additional work needed other than running this docker image.

For example I used a Intel Haswell GPU and a Radeon 480 with Antergos Linux. So far I haven’t run into any issues, but I haven’t run all of the notebooks yet. If you are not usually a Linux user and you want to give this a whirl, I would recommend Antergos or Fedora, since they both usually have recent kernels.

I hope this is helpful to someone, and let me know if there are any problems with this setup.

Cheers!

9 Likes

@bgorman Thanks for sharing this. Might give it a spin on my AMD GPU too.
How does the performance you see compare to a NVIDIA consumer GPU?

Also thanks from my part. I got my RX 570 8GB working with fast.ai on Ubuntu Mate with the docker file. Additionally I had to make the changes described here


and here

for it to work. Strangely my CPU (Ryzen 2700X) seems to be bottlenecking at the moment, so maybe I missed something.
But the GPU is speeding things up measurably compared to CPU only.

Has anyone had luck getting this working recently? There seems to be issues for some people where the host rocm software for rocm 2.10 and 3.0/3.1 fail/segfault/don’t work on the newer kernels (i.e. 5.0, 5.3 LTS) and the rock-dkms isn’t supported for these newer kernels.

I managed to get it working but had to change a little the dockerfile and created a docker-compose for easier deploying.

Check it out: https://github.com/perinm/deep-learning-for-coders

I don’t recommend though, free 6h gradient’s p5000 is 3x-4x times faster than my RX 580.

Thanks Lucas, in the end I ended up switching over to an nvidia card just recently. I had almost 6 months invested on this challenge but I’m mainly just getting started . Mainly wanted to learn a bit about Deep Learning without using an online service so nothing serious at this point. I was trying to get a upper-end budget AMD PC working with a reasonable performance on the lessons but it wasn’t to be.

Many of the issues I encountered while initially trying to get this working on the RX 580 were actually low level firmware issues in my case.

ASUS didn’t correctly set up compliant ACPI tables in the BIOS/firmware during the boot process. They seemed to use a kludge to get Ryzen CPU’s working in the first place. There were also other issues at this level which caused bricks when toggling preset values they had set. As a result rocm would not function correctly before getting to the docker stage.

Contacting ASUS was futile, they wouldn’t admit the defect, but offered to replace the board via RMA and this process in pandemic times is lengthy.

For those that may be running into similar issues, or run across this later. The critical issues for the mainboard seemed to stem from PCIE lines not being directly connected with the CPU. The only check I could find that’s visible when purchasing the board for these seemed to be whether the motherboard does/doesn’t support SLI/Crossfire which as far as I’m aware requires this.

The board/CPU may support PCIE Atomics needed by rocm but there may still be firmware level challenges to overcome. As far as I’m aware, no one with a SLI/Crossfire compatible mainboard has had issues issues.

The combo I worked with was an ASUS Prime B450 Plus which supported PCIE Atomics on the Ryzen 5 chip but would not enumerate the devices properly when an APU capable CPU was present alongside a GPU.

I read somewhere the problematic boards were translating PCIE lanes but I’m not enough of a wizard to dig that deep into the problem to confirm at this stage.

I had a crack at this with my AMD Radeon X5700XT(Navi 10) turned out that as of now ROCm is not supported on Navi10 cards but I figured this out after actually getting the docker container running on ubuntu. Might have better luck with directml on windows. Anyway all I had to do was un-version to get the python container running and increase its memory to 8gb. Here is the dockerfile which would potentially work with a ROCm capable card: use latest python 3 · perinm/deep-learning-for-coders@6faed53 · GitHub