I think this repo is dedicated to explore mixed-precision with PyTorch.
I’m running the scripts with the World Language Model, and can see a slight performance boost (+15%) with --fp16
on the 2070.
I think this repo is dedicated to explore mixed-precision with PyTorch.
I’m running the scripts with the World Language Model, and can see a slight performance boost (+15%) with --fp16
on the 2070.
Hi Ilia, sorry for the tardy reply. That’s a great idea, although, as you suspect, I doubt original. I offer a twist … assemble a few ‘small’ test sets of data to analyse. Place emphasis on GPU performance, NOT analytucal accuracy, but time taken for each epoc. This way we can compare horrid little machines to real grunt boxes without having to wait days or weeks for analytucal results. Hence a broad spectrum of performances can be readily comparde and duplicared by anyine wishing to test their ‘specual rigs’ (please pardon gamer parlance). What do you think Ilia, or anyone else? All suggestions welcome, no offence will be taken; I’m too old for that.
I’m flat out busy on distractions until the end of the month. Then December to catch up, then into FasrAI with ears pinned back.
Cheers Ilia, Peter Kelly.
As a follow-up to my setup failing to run .to_fp16()
with fastai on a RTX 2070, I found this ticket on PyTorch GH where they identified the cause for the bug I have (when running the script as “.py” in terminal, it generates the same error message “Floating point exception (core dumped)” ).
TLDR: it may be caused by a bug in CUdnn 7.1.4, “confirmed by Nvidia”, and was fixed by either reverting to 7.1.2 or upgrading to 7.2 (now 7.4 is available).
The tricky part, for me as a noob, is to find the way to get out of the 7.1.4 space.
When I check the installation procedure from Fastai (conda install -c pytorch -c fastai fastai pytorch-nightly cuda92
), the current version of PyTorch 1.0 comes out as bundled with CUdnn 7.1.4.
My question: for those of you successfully using an RTX card with fastaiV1 and mixed-precision, what is the version of CUdnn & CUDA installed on your system ?
BR
PS: I’m using a brand new installation of Ubuntu 16.04 dedicated to FastaiV1.
I built it right at the start of this course, and once Ubuntu was done, I went for the procedure described on Fastai GitHub.
That is:
conda install -c pytorch pytorch-nightly cuda92 conda install -c fastai torchvision-nightly conda install -c fastai fastai
I have 7.2.1
Try ‘conda install cudnn’ and see if that takes care of the issue.
As an aside, I built a new LXC container from scratch today and it ran fine, and didn’t even give me the usual cuda error 11. No improvement in speed though. Shoutout to @willismar for the excellent LXC tutorial.
Checking ‘conda list --explicit’ on the LXC container, I have
pytorch-nightly 1.0.0.dev20181109-py3.7_cuda9.2.148_cudnn7.1.4_0
but no cudnn listed individually. My main box has
pytorch-nightly 1.0.0.dev20181019 py3.7_cuda9.0.176_cudnn7.1.2_0 and cudnn 7.2.1
As an update, I fixed the issue of .to_fp16()
crashes by installing the cuda9.0 (with CUdnn 7.1.2) version of PyTorch instead of the cuda9.2 (with CUdnn 7.1.4) version.
So doing:
conda install pytorch-nightly -c pytorch
instead of
conda install pytorch-nightly cuda92 -c pytorch
.
Click on “Preview” tab to see the PyTorch 1.0 builds.
I have a question for people who got the 2080 TI: where did you get them? Seems like I missed the window for now, because it’s impossible to find them online. Also if you have one: did you get founders edition or a different manufacturer?
It is in stock here in the UK from nvidia shipping in 1-3 days. There are various stock trackers online eg ‘nowinstock’ and the nvidia forum might have early notice.
Hi Eric,
I upgraded drivers to 410.72. I now face exceptions at the end of training
BR,
Julien
Hi
I am getting error on this line in lesson 1:
interp = ClassificationInterpretation.from_learner(learn)
Floating point exception (core dumped)
Training is working fine but interpretation is giving error
I tried that a couple of years ago with 1060 instead of an SLI newer card. Never bought the second due to the 1060 being so slow. Kind of aiming for one, great, card now. All I need is to rob a liquor store to afford it.
similarly i am getting error now in
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
after epoch the process stop with same floating point exception
Another TensorFlow benchmark by Puget comparing the whole RTX family (2070, 2080, 2080Ti, dual-2080Ti) vs. the 1080Ti.
The 2070 is in a really sweet spot, compared to both 2080 and 1080Ti perfs.
Price-wise, the 2080 is a bit of “Meh…”
FWIW, I ran Fastai’s Cifar10 notebook on a Palit RTX 2080Ti-11G in Fp32 and Fp16.
The Fp32 with batch_size=512 shows a strong improvement over the 1080Ti-11G already: 35 sec per epoch vs 49 sec.
For the Fp16/mixed-precision, I failed to get anywhere close to Nvidia claims of " up to 2X faster than Fp32": about 31 sec per epoch with a batch_size=248.
I couldn’t go beyond bs=248 without either a CUDA or cuDNN error, which is counter-intuitive vs bs=512 in Fp32.
Note: I’m using the regular PyTorch 1.0 install package, didn’t install anything from source nor updated CUDA/cuDNN versions.
My notebooks: https://github.com/EricPerbos/GTX-vs-RTX-Deep-Learning-benchmarks
My guess is: once PyTorch and TensorFlow release their optimised and stable versions for the latest CUDA/cuDNN in a few months, the RTX 2070-8G & 2080Ti-11G will be fantastic tools.
We’re just not there yet when it comes to “Plug & Play”
Tried your notebooks on my 2080 - 13% faster at fp16 than fp32 using BS 256. FP32 could do BS 400 but not 440. FP16 could do BS 800 but not 840, so it did show ~double the capacity. I’m on the 415 driver, but otherwise just the standard install on ubuntu 18.04.
@Ralph Thank you for the feedback and details.
It’s one of those mind-boggling things with CUDA/cuDNN versions and all the DL libraries on Ubuntu, it can be surprisingly complex & obscure to get the latest technologies to deliver as promised.
(at least for me, coming from a hardcore gamer experience on Windows )
BTW, @cpbotha (he’s not on V3 course so can’t access this thread, according to the forum’s UI) wrote a post on running Fastai in Mixed-Precision with 18.04 and CUDA 10, on a Tesla V100.
Installed Cuda 10, cudnn 7.4 and the PyTorch wheel in that link.
Cuda and cudnn didn’t change epoch times at all.
As part of prep for installing the pytorch wheel, the instructions install mkldnn.
This knocked my FP16 epochs from 1:16 to 1:10. Interrupt kernel, conda install mkldnn in a jupyter terminal, shift-enter the same cell I was training before interrupt, and the time is better. Good to know.
But after the wheel install, FP32 is 10% slower, and FP16 crashes every time.
I might try this again on a clean LXC container, but for now it is back to the conda pytorch +mkldnn.
– After changing back to conda, but with mkldnn, FP32 epochs are barely better than before, but FP16 epochs are averaging 1:07 in a longer test when 1:16 was the norm for many previous tests. So now I am looking at 67/89 factor for FP16 speedup on RN101, vs 76/90 before. The first epoch is a few extra seconds so I guess there is some compilation going on.
https://github.com/intel/mkl-dnn - Intel® Math Kernel Library for Deep Neural Networks
Am I right that nobody has been able to get something close to a double speed up when using fp16 versus fp32 on 2070/2080/2080Ti cards yet?
Also need to work out
if second GPU will physically fit - I had issues in that because of its thickness (5.5cm), first GPU took up 2.5 PCIE slots so could only use slot 4 for second GPU - and could not fit a second GPU of 5.5cm thickness due to case space limits. Standard GPU thicknesses which take only 2 PCIE slot space would have been OK.
Depending on type of GPU - if air intake if on top of GPU (eg 3 open fan type) and standard thickness for GPU and only have room in box to fit GPU’s in slots 1 and 3 (eg small case and slot 4 is close to base of case) - then may not get enough air to GPU in slot 1, and could get pretty hot.
Yeah, that makes sense. Finally, I have 2080 with 1080Ti on top of it I have removed Wi-Fi received board from PCIEx4 which was mounted right under the first GPU. Also, you’re right and GPUs are a bit close to each other, like 3-4 cm, I guess.
Just made a post regarding the issues of decreasing performance when using gaming GPUs (vs Blower ones) in a Multi-GPUs setup, due to over-heating, as investigated by Puget.