How to install RTX enabled fastai? (CUDA10)

Good to know… you can try to just update your driver then!
This will install “only driver 410.xx”

using the PPA and not touching anything else

sudo add-apt-repository ppa:graphics-drivers/ppa -y
sudo apt update -qq
sudo apt install -y --no-install-recommends cuda-drivers

Should I replace cuda-drivers with a specific version? Because the last command returns:

Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package cuda-drivers

And, I already have the repository added.

1 Like

Hi, use this instead

sudo apt install --no-install-recommends \
      libcuda1-410 \
      libxnvctrl0 \
      nvidia-410 \
      nvidia-410-dev \
      nvidia-libopencl1-410 \
      nvidia-opencl-icd-410 \

Just in case any here needs a fully CUDA 10 based PyTorch nightly build (including magma-cuda10) for Python 3.7 on Ubuntu 18.04, see here:

(in that post, I obviously use the new fastai documentation to test the fp16 callback :slight_smile:

I built this pytorch package because I have my eye on an RTX 2070.


Thanks for sharing @cpbotha

What magma version it is 2.4.0 or 2.4.1 ?

@willismar Thank you for your advice, yes, I just updated the driver using:

sudo apt install nvidia-driver-410

After that I have two recognized devices:

| NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 23%   32C    P8    12W / 250W |    110MiB / 11177MiB |      0%      Default |
|   1  GeForce RTX 2080    Off  | 00000000:02:00.0 Off |                  N/A |
| 41%   29C    P8    17W / 225W |      0MiB /  7952MiB |      0%      Default |

I was able to run fp16 training. However, need more tests to see if everything works as expected. I didn’t try to install a new version of PyTorch, or compile it from sources. So it uses CUDA 9.2, I guess.

@cpbotha Yes, I was looking for something like this. I guess I’ll try this approach soon.

By the way, it seems there is a little bug with fastai.show_install:

=== Hardware ===
nvidia gpus     : 2
torch available : 2
  - gpu0        : 11177MB | GeForce RTX 2080
  - gpu1        : 7952MB | GeForce GTX 1080 Ti

The devices names don’t match with their memory sizes. In my system, gpu1 is 1080Ti.

1 Like

I used magma 2.3.0, because that was the last version used by the build scripts.

Does 2.4.x have improvements which justify a rebuild of PyTorch? (it takes a few hours in total on an i7 with SSD)

Hi again @devforfu

I believe you can list this as a Bug on so they can fix it. But if the SMI application is ok everything will be fine for you.

Btw you may have your fastai working but be aware that does not exist cuda9.xx or lower version to ubuntu 18.04 only cuda10 is available. As professor jeremy told us, using fastai v1.0 you can do whatever you want.

I compared other day teh source code of magma 2.3.0 and 2.4.0 on the context of the patch files and I did find out that almost all patches from pytorch build project for magma is already on the new version.

The only patches that someone may needs apply if will is the cmakelists.patch and thread_queue.patch to magma 2.4.0

I can show you if you need

@cpbotha Thanks a lot for your blog post, I’ll read it in more details tomorrow.

You don’t have access to the V3 section of the forum I understand, otherwise you’d see a post on the “RTX series + Fastai”, and some of us (especially me !) are struggling to get Mixed-Precision to run properly on either 2070 or 2080Ti.

The most annoying/surprising part is that Fp16 won’t run beyond a batch-size HALF of the Fp32 (248 max vs 512) on Fastai CIFAR10 notebook (the one in GitHub repo) while it should in theory run DOUBLE. It still runs 10% faster than FP32, despite half batch-size :triumph:

BTW, I posted a few runs of CIFAR10 with my 1080Ti, 2070 and 2080Ti here.

Durnit, I suspected there was a part of the forum I was not able to see. :frowning:

Interesting that batch size constraint. Did sgugger also take a look? Could it be due to the divisible-by-8 fp16-constraint? (would be strange, because there are too many clever people hanging out on this forum who would have diagnosed that first)

We had a quick DM exchange with sgugger on KaggleNoobs slack on this issue, his immediate take was an issue with my configuration.

You’re not alone with all these issues :smile:
I’m still sitting with old drivers and waiting for a valid solution that could provide a boost in performance. I was running fp16 training without real speed up, and for now,​ don’t have enough time to install everything from source.

Magma-2.5.0-rc1 released
Also works the same way of the 2.4.0, just need to use the same patches:
cmakelists.patch and thread_queue.patch ignoring the others

MAGMA 2.5.0 RC1 2018-11-16
MAGMA 2.5.0 RC1 is now released. Updates include:
  • New routine: magmablas_Xgemm_batched_strided (X = {s, d, c, z}) is the stride-based variant of magmablas_Xgemm_batched;
  • New routine: magma_Xgetrf_native (X = {s, d, c, z}) performs the LU factorization with partial pivoting using the GPU only. It has the same interface as the hybrid (CPU+GPU) implementation provided by magma_Xgetrf_gpu. Testing the performance of this routine is possible through running testing_Xgetrf_gpu with the option (–version 3);
  • New routine: magma_Xpotrf_native (X = {s, d, c, z}) performs the Cholesky factorization using the GPU only. It has the same interface as the hybrid (CPU+GPU) implementation provided by magma_Xpotrf_gpu.
    Testing the performance of this routine is possible through running testing_Xpotrf_gpu with the option (–version 2)
  • Added benchmark for GEMM in FP16 arithmetic (HGEMM) as well as auxiliary functions to cast matrices from FP32 to FP16 storage (magmablas_slag2h) and from FP16 to FP32 (magmablas_hlag2s).|
MAGMA 2.4.0 2018-06-25
MAGMA 2.4.0 is now released. Updates include:
  • Added constrained least squares routines (magma_[sdcz]gglse) and dependencies:
    magma_zggrqf - generalized RQ factorization
    magma_zunmrq - multiply by orthogonal Q as returned by zgerqf
  • Performance improvements across many batch routines, including batched TRSM, batched LU, batched LU-nopiv, and batched Cholesky
  • Fixed some compilation issues with inf, nan, and nullptr.


  • Changed the way how data from an external application is handled:
    There is now a clear distinction between memory allocated/used/freed from MAGMA and the user application. We added a functions magma_zvcopy and magma_zvpass that do not allocate memory, instead they copy values from/to application-allocated memory.
  • The examples ( in example/example_sparse.c ) give a demonstration on how these routines should be used.|

Not to dampen any spirits, but to show there is another way, I am running my 2080 ti quite happily on pytorch v1 nightly, cuda 9, driver 410, fastai v1, and 18.04, and it purrs along faster than my Titan Xp. Looking forward to a day around the new year to try and get whatever is released by then in harmony.

1 Like

… or you can just use the ansible script I put together, along with the PyTorch 1.0 CUDA 10 binaries I currently build (now up to date up to 2018-11-24, i.e. today).

Blog post:

(This is of course what I use on my own RTX, but also on the PaperSpace and GCE V100 machines I sometimes bust out, in mixed precision of course, for the larger networks.)

Obviously you could also just choose to use my build directly without the script, but then you have to pay careful attention to the instructions:


Curious, has anyone resolved the batch size problem (needing to reduce it to half) on FP16 training? It sounds like it’s not worth it to invest in RTX right now, if there is no real speed improvement.

No real speed improvement, no memory usage improvement, rtx 2070 ,fastai v1 cudnn7.4 cuda10:joy:

What is the base case you are trying to compare ?

@willismar Yeah, I give a detailed comparision at Comparision between .to_fp16() and .to_fp32() with MNIST_SAMPLE on RTX 2070