GTX 2080/2080Ti RTX for Deep Learning?

EricPB · November 9, 2018, 7:20pm

I think this repo is dedicated to explore mixed-precision with PyTorch.

I’m running the scripts with the World Language Model, and can see a slight performance boost (+15%) with --fp16 on the 2070.

PeterKelly · November 10, 2018, 2:45am

Hi Ilia, sorry for the tardy reply. That’s a great idea, although, as you suspect, I doubt original. I offer a twist … assemble a few ‘small’ test sets of data to analyse. Place emphasis on GPU performance, NOT analytucal accuracy, but time taken for each epoc. This way we can compare horrid little machines to real grunt boxes without having to wait days or weeks for analytucal results. Hence a broad spectrum of performances can be readily comparde and duplicared by anyine wishing to test their ‘specual rigs’ (please pardon gamer parlance). What do you think Ilia, or anyone else? All suggestions welcome, no offence will be taken; I’m too old for that.
I’m flat out busy on distractions until the end of the month. Then December to catch up, then into FasrAI with ears pinned back.
Cheers Ilia, Peter Kelly.

EricPB · November 10, 2018, 6:13pm

As a follow-up to my setup failing to run .to_fp16() with fastai on a RTX 2070, I found this ticket on PyTorch GH where they identified the cause for the bug I have (when running the script as “.py” in terminal, it generates the same error message “Floating point exception (core dumped)” ).

TLDR: it may be caused by a bug in CUdnn 7.1.4, “confirmed by Nvidia”, and was fixed by either reverting to 7.1.2 or upgrading to 7.2 (now 7.4 is available).

github.com/pytorch/pytorch

Floating point exception (core dumped) with Conv2d

opened 02:46PM - 16 Jul 18 UTC

closed 10:28PM - 29 Aug 18 UTC

rotabulo

high priority

## Issue description The following code produces a `Floating point exception …(core dumped)` on Volta series with fp16: ``` import torch from torch.nn import Conv2d conv=torch.nn.Conv2d(256, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False).cuda().half() x=torch.rand(2,256,196,196, requires_grad=True).cuda().half() y=conv(x) loss=y.min() loss.backward() ``` By changing the convolution parameters by +1/-1 the code works. I suspect this to be a cudnn bug. ## System Info ``` Collecting environment information... PyTorch version: 0.4.0a0+3749c58 Is debug build: No CUDA used to build PyTorch: 9.2.88 OS: Ubuntu 16.04.4 LTS GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 CMake version: version 3.5.1 Python version: 3.5 Is CUDA available: Yes CUDA runtime version: 9.2.88 GPU models and configuration: GPU 0: TITAN V GPU 1: TITAN V GPU 2: TITAN V GPU 3: TITAN V Nvidia driver version: 396.26 cuDNN version: Probably one of the following: /usr/local/cuda-9.2/cuda/lib64/libcudnn.so.7.1.4 /usr/local/cuda-9.2/cuda/lib64/libcudnn_static.a /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudnn.so.7.1.4 /usr/local/cuda-9.2/targets/x86_64-linux/lib/libcudnn_static.a Versions of relevant libraries: [pip3] numpy (1.14.5) [pip3] torch (0.4.0a0+3749c58) [pip3] torchvision (0.2.1) [conda] Could not collect ``` Reproducible also on V100.

The tricky part, for me as a noob, is to find the way to get out of the 7.1.4 space.
When I check the installation procedure from Fastai (conda install -c pytorch -c fastai fastai pytorch-nightly cuda92), the current version of PyTorch 1.0 comes out as bundled with CUdnn 7.1.4.

My question: for those of you successfully using an RTX card with fastaiV1 and mixed-precision, what is the version of CUdnn & CUDA installed on your system ?

BR

PS: I’m using a brand new installation of Ubuntu 16.04 dedicated to FastaiV1.
I built it right at the start of this course, and once Ubuntu was done, I went for the procedure described on Fastai GitHub.

That is:

conda install -c pytorch pytorch-nightly cuda92
conda install -c fastai torchvision-nightly
conda install -c fastai fastai

Ralph · November 11, 2018, 12:39am

I have 7.2.1

Try ‘conda install cudnn’ and see if that takes care of the issue.

As an aside, I built a new LXC container from scratch today and it ran fine, and didn’t even give me the usual cuda error 11. No improvement in speed though. Shoutout to @willismar for the excellent LXC tutorial.

Checking ‘conda list --explicit’ on the LXC container, I have
pytorch-nightly 1.0.0.dev20181109-py3.7_cuda9.2.148_cudnn7.1.4_0
but no cudnn listed individually. My main box has
pytorch-nightly 1.0.0.dev20181019 py3.7_cuda9.0.176_cudnn7.1.2_0 and cudnn 7.2.1

EricPB · November 12, 2018, 8:20pm

As an update, I fixed the issue of .to_fp16() crashes by installing the cuda9.0 (with CUdnn 7.1.2) version of PyTorch instead of the cuda9.2 (with CUdnn 7.1.4) version.

So doing:
conda install pytorch-nightly -c pytorch instead of
conda install pytorch-nightly cuda92 -c pytorch.

Click on “Preview” tab to see the PyTorch 1.0 builds.

wdhorton · November 12, 2018, 10:17pm

I have a question for people who got the 2080 TI: where did you get them? Seems like I missed the window for now, because it’s impossible to find them online. Also if you have one: did you get founders edition or a different manufacturer?

digitalspecialists · November 12, 2018, 10:29pm

It is in stock here in the UK from nvidia shipping in 1-3 days. There are various stock trackers online eg ‘nowinstock’ and the nvidia forum might have early notice.

julienM · November 13, 2018, 9:34am

Hi Eric,

I upgraded drivers to 410.72. I now face exceptions at the end of training

so I am not in a position to provide you any tips on to_fp16 usage

BR,
Julien

ymittal23 · November 13, 2018, 4:57pm

Hi
I am getting error on this line in lesson 1:
interp = ClassificationInterpretation.from_learner(learn)

Floating point exception (core dumped)
Training is working fine but interpretation is giving error

Gabriel_Syme · November 14, 2018, 2:38am

I tried that a couple of years ago with 1060 instead of an SLI newer card. Never bought the second due to the 1060 being so slow. Kind of aiming for one, great, card now. All I need is to rob a liquor store to afford it.

ymittal23 · November 14, 2018, 10:12am

similarly i am getting error now in
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

after epoch the process stop with same floating point exception

EricPB · November 14, 2018, 10:21am

Another TensorFlow benchmark by Puget comparing the whole RTX family (2070, 2080, 2080Ti, dual-2080Ti) vs. the 1080Ti.

The 2070 is in a really sweet spot, compared to both 2080 and 1080Ti perfs.
Price-wise, the 2080 is a bit of “Meh…”

EricPB · November 21, 2018, 6:59pm

FWIW, I ran Fastai’s Cifar10 notebook on a Palit RTX 2080Ti-11G in Fp32 and Fp16.

The Fp32 with batch_size=512 shows a strong improvement over the 1080Ti-11G already: 35 sec per epoch vs 49 sec.

For the Fp16/mixed-precision, I failed to get anywhere close to Nvidia claims of " up to 2X faster than Fp32": about 31 sec per epoch with a batch_size=248.
I couldn’t go beyond bs=248 without either a CUDA or cuDNN error, which is counter-intuitive vs bs=512 in Fp32.
Note: I’m using the regular PyTorch 1.0 install package, didn’t install anything from source nor updated CUDA/cuDNN versions.

My notebooks: https://github.com/EricPerbos/GTX-vs-RTX-Deep-Learning-benchmarks

My guess is: once PyTorch and TensorFlow release their optimised and stable versions for the latest CUDA/cuDNN in a few months, the RTX 2070-8G & 2080Ti-11G will be fantastic tools.

We’re just not there yet when it comes to “Plug & Play”

Ralph · November 21, 2018, 11:21pm

Tried your notebooks on my 2080 - 13% faster at fp16 than fp32 using BS 256. FP32 could do BS 400 but not 440. FP16 could do BS 800 but not 840, so it did show ~double the capacity. I’m on the 415 driver, but otherwise just the standard install on ubuntu 18.04.

EricPB · November 22, 2018, 8:20pm

@Ralph Thank you for the feedback and details.

It’s one of those mind-boggling things with CUDA/cuDNN versions and all the DL libraries on Ubuntu, it can be surprisingly complex & obscure to get the latest technologies to deliver as promised.
(at least for me, coming from a hardcore gamer experience on Windows )

BTW, @cpbotha (he’s not on V3 course so can’t access this thread, according to the forum’s UI) wrote a post on running Fastai in Mixed-Precision with 18.04 and CUDA 10, on a Tesla V100.

Ralph · November 23, 2018, 1:42am

Installed Cuda 10, cudnn 7.4 and the PyTorch wheel in that link.
Cuda and cudnn didn’t change epoch times at all.
As part of prep for installing the pytorch wheel, the instructions install mkldnn.
This knocked my FP16 epochs from 1:16 to 1:10. Interrupt kernel, conda install mkldnn in a jupyter terminal, shift-enter the same cell I was training before interrupt, and the time is better. Good to know.
But after the wheel install, FP32 is 10% slower, and FP16 crashes every time.
I might try this again on a clean LXC container, but for now it is back to the conda pytorch +mkldnn.

– After changing back to conda, but with mkldnn, FP32 epochs are barely better than before, but FP16 epochs are averaging 1:07 in a longer test when 1:16 was the norm for many previous tests. So now I am looking at 67/89 factor for FP16 speedup on RN101, vs 76/90 before. The first epoch is a few extra seconds so I guess there is some compilation going on.

https://github.com/intel/mkl-dnn - Intel® Math Kernel Library for Deep Neural Networks

devforfu · November 23, 2018, 3:03am

Am I right that nobody has been able to get something close to a double speed up when using fp16 versus fp32 on 2070/2080/2080Ti cards yet?

adrian · November 30, 2018, 2:08am

Also need to work out

if second GPU will physically fit - I had issues in that because of its thickness (5.5cm), first GPU took up 2.5 PCIE slots so could only use slot 4 for second GPU - and could not fit a second GPU of 5.5cm thickness due to case space limits. Standard GPU thicknesses which take only 2 PCIE slot space would have been OK.
Depending on type of GPU - if air intake if on top of GPU (eg 3 open fan type) and standard thickness for GPU and only have room in box to fit GPU’s in slots 1 and 3 (eg small case and slot 4 is close to base of case) - then may not get enough air to GPU in slot 1, and could get pretty hot.

devforfu · November 30, 2018, 3:38pm

Yeah, that makes sense. Finally, I have 2080 with 1080Ti on top of it I have removed Wi-Fi received board from PCIEx4 which was mounted right under the first GPU. Also, you’re right and GPUs are a bit close to each other, like 3-4 cm, I guess.

EricPB · November 30, 2018, 7:18pm

Just made a post regarding the issues of decreasing performance when using gaming GPUs (vs Blower ones) in a Multi-GPUs setup, due to over-heating, as investigated by Puget.