I was able to run fp16 training. However, need more tests to see if everything works as expected. I didn’t try to install a new version of PyTorch, or compile it from sources. So it uses CUDA 9.2, I guess.
@cpbotha Yes, I was looking for something like this. I guess I’ll try this approach soon.
By the way, it seems there is a little bug with fastai.show_install:
I believe you can list this as a Bug on Fast.ai so they can fix it. But if the SMI application is ok everything will be fine for you.
Btw you may have your fastai working but be aware that does not exist cuda9.xx or lower version to ubuntu 18.04 only cuda10 is available. As professor jeremy told us, using fastai v1.0 you can do whatever you want.
Hi
I compared other day teh source code of magma 2.3.0 and 2.4.0 on the context of the patch files and I did find out that almost all patches from pytorch build project for magma is already on the new version.
The only patches that someone may needs apply if will is the cmakelists.patch and thread_queue.patch to magma 2.4.0
@cpbotha Thanks a lot for your blog post, I’ll read it in more details tomorrow.
You don’t have access to the V3 section of the forum I understand, otherwise you’d see a post on the “RTX series + Fastai”, and some of us (especially me !) are struggling to get Mixed-Precision to run properly on either 2070 or 2080Ti.
The most annoying/surprising part is that Fp16 won’t run beyond a batch-size HALF of the Fp32 (248 max vs 512) on Fastai CIFAR10 notebook (the one in GitHub repo) while it should in theory run DOUBLE. It still runs 10% faster than FP32, despite half batch-size
BTW, I posted a few runs of CIFAR10 with my 1080Ti, 2070 and 2080Ti here.
Durnit, I suspected there was a part of the forum I was not able to see.
Interesting that batch size constraint. Did sgugger also take a look? Could it be due to the divisible-by-8 fp16-constraint? (would be strange, because there are too many clever people hanging out on this forum who would have diagnosed that first)
You’re not alone with all these issues
I’m still sitting with old drivers and waiting for a valid solution that could provide a boost in performance. I was running fp16 training without real speed up, and for now, don’t have enough time to install everything from source.
Magma-2.5.0-rc1 released
Also works the same way of the 2.4.0, just need to use the same patches: cmakelists.patch and thread_queue.patch ignoring the others
MAGMA 2.5.0 RC1
2018-11-16
MAGMA 2.5.0 RC1 is now released. Updates include:
New routine: magmablas_Xgemm_batched_strided (X = {s, d, c, z}) is the stride-based variant of magmablas_Xgemm_batched;
New routine: magma_Xgetrf_native (X = {s, d, c, z}) performs the LU factorization with partial pivoting using the GPU only. It has the same interface as the hybrid (CPU+GPU) implementation provided by magma_Xgetrf_gpu. Testing the performance of this routine is possible through running testing_Xgetrf_gpu with the option (–version 3);
New routine: magma_Xpotrf_native (X = {s, d, c, z}) performs the Cholesky factorization using the GPU only. It has the same interface as the hybrid (CPU+GPU) implementation provided by magma_Xpotrf_gpu.
Testing the performance of this routine is possible through running testing_Xpotrf_gpu with the option (–version 2)
Added benchmark for GEMM in FP16 arithmetic (HGEMM) as well as auxiliary functions to cast matrices from FP32 to FP16 storage (magmablas_slag2h) and from FP16 to FP32 (magmablas_hlag2s).|
MAGMA 2.4.0
2018-06-25
MAGMA 2.4.0 is now released. Updates include:
Added constrained least squares routines (magma_[sdcz]gglse) and dependencies:
magma_zggrqf - generalized RQ factorization
magma_zunmrq - multiply by orthogonal Q as returned by zgerqf
Performance improvements across many batch routines, including batched TRSM, batched LU, batched LU-nopiv, and batched Cholesky
Fixed some compilation issues with inf, nan, and nullptr.
MAGMA-sparse
Changed the way how data from an external application is handled:
There is now a clear distinction between memory allocated/used/freed from MAGMA and the user application. We added a functions magma_zvcopy and magma_zvpass that do not allocate memory, instead they copy values from/to application-allocated memory.
The examples ( in example/example_sparse.c ) give a demonstration on how these routines should be used.|
Not to dampen any spirits, but to show there is another way, I am running my 2080 ti quite happily on pytorch v1 nightly, cuda 9, driver 410, fastai v1, and 18.04, and it purrs along faster than my Titan Xp. Looking forward to a day around the new year to try and get whatever is released by then in harmony.
… or you can just use the ansible script I put together, along with the PyTorch 1.0 CUDA 10 binaries I currently build (now up to date up to 2018-11-24, i.e. today).
(This is of course what I use on my own RTX, but also on the PaperSpace and GCE V100 machines I sometimes bust out, in mixed precision of course, for the larger networks.)
Curious, has anyone resolved the batch size problem (needing to reduce it to half) on FP16 training? It sounds like it’s not worth it to invest in RTX right now, if there is no real speed improvement.