How to install RTX enabled fastai? (CUDA10)

We had a quick DM exchange with sgugger on KaggleNoobs slack on this issue, his immediate take was an issue with my configuration.

You’re not alone with all these issues :smile:
I’m still sitting with old drivers and waiting for a valid solution that could provide a boost in performance. I was running fp16 training without real speed up, and for now,​ don’t have enough time to install everything from source.

Magma-2.5.0-rc1 released
Also works the same way of the 2.4.0, just need to use the same patches:
cmakelists.patch and thread_queue.patch ignoring the others

MAGMA 2.5.0 RC1 2018-11-16
MAGMA 2.5.0 RC1 is now released. Updates include:
  • New routine: magmablas_Xgemm_batched_strided (X = {s, d, c, z}) is the stride-based variant of magmablas_Xgemm_batched;
  • New routine: magma_Xgetrf_native (X = {s, d, c, z}) performs the LU factorization with partial pivoting using the GPU only. It has the same interface as the hybrid (CPU+GPU) implementation provided by magma_Xgetrf_gpu. Testing the performance of this routine is possible through running testing_Xgetrf_gpu with the option (–version 3);
  • New routine: magma_Xpotrf_native (X = {s, d, c, z}) performs the Cholesky factorization using the GPU only. It has the same interface as the hybrid (CPU+GPU) implementation provided by magma_Xpotrf_gpu.
    Testing the performance of this routine is possible through running testing_Xpotrf_gpu with the option (–version 2)
  • Added benchmark for GEMM in FP16 arithmetic (HGEMM) as well as auxiliary functions to cast matrices from FP32 to FP16 storage (magmablas_slag2h) and from FP16 to FP32 (magmablas_hlag2s).|
MAGMA 2.4.0 2018-06-25
MAGMA 2.4.0 is now released. Updates include:
  • Added constrained least squares routines (magma_[sdcz]gglse) and dependencies:
    magma_zggrqf - generalized RQ factorization
    magma_zunmrq - multiply by orthogonal Q as returned by zgerqf
  • Performance improvements across many batch routines, including batched TRSM, batched LU, batched LU-nopiv, and batched Cholesky
  • Fixed some compilation issues with inf, nan, and nullptr.


  • Changed the way how data from an external application is handled:
    There is now a clear distinction between memory allocated/used/freed from MAGMA and the user application. We added a functions magma_zvcopy and magma_zvpass that do not allocate memory, instead they copy values from/to application-allocated memory.
  • The examples ( in example/example_sparse.c ) give a demonstration on how these routines should be used.|

Not to dampen any spirits, but to show there is another way, I am running my 2080 ti quite happily on pytorch v1 nightly, cuda 9, driver 410, fastai v1, and 18.04, and it purrs along faster than my Titan Xp. Looking forward to a day around the new year to try and get whatever is released by then in harmony.

1 Like

… or you can just use the ansible script I put together, along with the PyTorch 1.0 CUDA 10 binaries I currently build (now up to date up to 2018-11-24, i.e. today).

Blog post:

(This is of course what I use on my own RTX, but also on the PaperSpace and GCE V100 machines I sometimes bust out, in mixed precision of course, for the larger networks.)

Obviously you could also just choose to use my build directly without the script, but then you have to pay careful attention to the instructions:


Curious, has anyone resolved the batch size problem (needing to reduce it to half) on FP16 training? It sounds like it’s not worth it to invest in RTX right now, if there is no real speed improvement.

No real speed improvement, no memory usage improvement, rtx 2070 ,fastai v1 cudnn7.4 cuda10:joy:

What is the base case you are trying to compare ?

@willismar Yeah, I give a detailed comparision at Comparision between .to_fp16() and .to_fp32() with MNIST_SAMPLE on RTX 2070