GTX 2080/2080Ti RTX for Deep Learning?

Ralph · November 6, 2018, 6:33pm

I was using learn = to_fp16(…) rather than (…).to_fp16. See if that works for you.

I’m on the 410 driver as well. I usually get a CUDA error 11 the first time I ‘learn.fit_one_cycle(1)’ but it works without a hitch when I re-run the cell.

PeterKelly · November 6, 2018, 11:10pm

Hi ilia, with my tail between my hind legs and head bowed in shame I must confess that I have rendered my ‘upgrade’ and planned successive performance steps USULESS, for now. Suffice to say that I can’t even boot from my poor old ( formerly) trusty tower anymore. I do have an Asus G75V with 2T of ssd and have been doing The Master’s (Jeremy’s) course on it. Believe me Ilia, I am working on it; I am even reading the installation instructions for the Samsung 2T vnand 860 pro and the m.2 970 evo. The Gigabyte 2080 Ti oc looks GREAT, good for photos. When I do get all working, AI will probably still be popular.
Is there a general thread where I can sing Jeremy’s (and Rachael’s) praises? I think thus should be done from the global AI rooftops. He , virtually alone, has revived AI to a fabulous state, translated it from ‘nerd-speak’, worked like a deamon in proving himself and his methods (non-confirmist, and for good reason), wrestled the AI from the strict realms of the academics, read heaps of papers and sorted them, cut through the in-breed jargon, prepared superbly understandable lessons and videos into THE premier AI course and presents it FREE for us unclean masses in the fabulous MOOC. WHAT A GREAT GUY! And, of course, he is Australian, AND from Melbourne. Thanks also to the San Francisco University for supporting him in this Globally Disruptive Technological thrust. This will inspire global changes to many things, you ain’t seen nothin’ yet.
I’ll try to keep you informed if any significant progress on my little desktop Ilia.
Cheers for now,
Peter Kelly

EricPB · November 7, 2018, 4:34pm

I just installed the RTX 2070, it works fine in standard precision but when I try the .to_fp16(), or “leanr= to_fp16(create_cnn(…))”, it crashes my kernel without an error message in Jupyter Notebook.
So rather hard to debug

I’m using Ubuntu 16.04 and nvidia 410.73.

Strangely the .to_fp16() command works with my 1080Ti without crashing the kernel (but no performance boost), thus my post yesterday https://forums.fast.ai/t/gtx-2080-2080ti-rtx-for-deep-learning/26783/60?u=ericpb

gsg · November 7, 2018, 4:58pm

I am also getting

The kernel appears to have died. It will restart automatically.

when I try the `“leanr= to_fp16(create_cnn(…))”,
using ubuntu and nvidia 396.44 on a V100…

balnazzar · November 7, 2018, 7:06pm

It’s not, in my opinion.

Less powerful, less memory, troubles with .to_fp16()

Once they address such problems, a couple of 2070 for 1 grand will be great: one card with 16gb, and more power than the 1080ti (maybe even than the 2080ti).

EricPB · November 7, 2018, 9:27pm

That’s very much spot on: for €550-600 one can get either a used 1080Ti 11go (last-gen best in class) or a new RTX 2070 8go.
The key difference is access to Tensor Cores + FP16 (or mixed precision) with the 2070, potentially doubling its VRAM or at least on-par with the 1080Ti.

Getting the RTX (whatever model ?) to activate its Tensor Cores for either PyTorch or TensorFlow is not yet a simple “Plug & Play” as I discover myself, making a simple benchmark such as cifar10 (thank you @sgugger ) quite challenging.

BTW I’d love to hear from the recent owners of 2080/2080Ti if/how they managed to run Fastai in mixed precision, and how it compares to standard FP32 training.

julienM · November 7, 2018, 10:40pm

I just tried using to_fp16() with a 2080Ti and didn’t face any crash. BTW I don’t see any performance improvement on this simple test (classifier with 1k images)

Ralph · November 7, 2018, 10:44pm

I ran a bunch of benchmarks on the Pets notebook, someone’s modified version of it, using my 2080.

|RN|FP|BS|SZ|Time|Error|
|34|16|100|224|4_47|0.053|
|34|32|100|224|5_16|0.055|
|34|32| 48|320|7_29|0.053|
|50|32| 32|320|9_19|0.044|
|50|16| 32|320|8_40|0.043|
|50|16| 32|299|8_07|0.044|
|50|16| 64|299|7_20|0.041|

I think I have some bottleneck as another posted similar times on a 1060, but doubling the batch size was as simple as wrapping the learner in to_fp16(…).

balnazzar · November 7, 2018, 10:46pm

I don’t like used cards. Often they come clogged with dust and dirt, and getting them clean is not an easy task. I’d say buy a new 1080ti, or wait until prices drop and go for 2 2070.

EricPB · November 7, 2018, 11:11pm

Hej JulienM,

Thank you for replying as a RTX 2080Ti owner, that helps a lot !

So you didn’t face any crash while using to_fp16(), even though you didn’t see any performance improvement (secondary question: metrics or speed/duration ?).

Could you share more info about your experience: with an official Fastai mooc notebook, a Kaggle kernel or else ?
So I can try to replicate your exact code while using the RTX 2070.

BR,

EPB

EricPB · November 7, 2018, 11:15pm

Hej Ralph !

Thank you for providing more info.

When you ran a bunch of benchmarks on the Pets notebook: did you activate the TensorCores and/or FP16/MixedPrecision in any specific way ?
I couldn’t find a specific TensorCores/FP16 code in the version you linked.

BR,

EPB

Ralph · November 7, 2018, 11:20pm

All I did was switch between:

#learn = to_fp16(create_cnn(data, models.resnet34, metrics=error_rate))
learn = (create_cnn(data, models.resnet34, metrics=error_rate))

The linked notebook has more epochs than the original, which explains the slower times in the benchmarks.

Going back to the original notebook, the change above increases my max batch size from 192 to 375,

I’m running cuda 9.2 and nvidia 410.73, but 410.66 also worked.

EricPB · November 7, 2018, 11:21pm

I created a post on PyTorch forums regarding a simple test to ensure TensorCores are used with a RTX 2070.

devforfu · November 8, 2018, 3:39am

Hi Peter! I’ve installed 2080 into my machine (sabertooth x99, i7-6800K), so now two GPUs are on board. So far both of them seem to work

I am going to figure out how they perform in comparison with each other using advice collected in this thread. Because so far I haven’t done too many fp16 experiments to unleash the full potential of RTX card.

Talking about benchmarks, I think it would be great if we can test our 1080Ti, 2070, 2080, and 2080Ti cards in various training settings, like using various datasets and architectures. I know there are some benchmarks on the internet, however, we can make them available not only as metrics but as the code as well using fastai mixed precision training capabilities. So anyone can reproduce and test their cards/configurations. Probably something similar is already on the list to be implemented by fastai team?

EricPB · November 8, 2018, 4:32pm

I ran the notebook as a Python script (suggested in the PyTorch forum) into the terminal.

The error generated when running to_fp16() is Floating point exception (core dumped)

EricPB · November 8, 2018, 6:55pm

I’m encountering the same issue with .to_fp16() when running the Cifar10 notebook straight from Fastai github:

Cifar10

One of the moderators on PyTorch forum suggested a cause but it’s beyond my pay-grade

could it be you are using torch.float data somewhere in torch.half layers?
Could you post your model definition?

Can anyone help to answer ?

cc @sgugger

EricPB · November 8, 2018, 7:27pm

In the meantime, a quick benchmark of Cifar10 for the RTX 2070 vs the 1080Ti, in FP32.

The 2070 is achieving almost identical speed (52sec vs 49sec per epoch) with a batch size more than half smaller !?!
Am I doing something wrong ?

GTX 1080Ti 11Gb

RTX 2070 8Gb

Ralph · November 8, 2018, 8:59pm

When I run your code on the 2080, learn.fir_one_cycle gives me the usual error the first time but works fine the second go - no kernel restart.

RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch-nightly_1539945974892/work/aten/src/THC/THCGeneral.cpp:421

I can run as high as BS = 184 with a 45 sec average epoch.
Adding to_fp16 drops the time to 40 seconds.
Increasing BS to 368 drops time to 38.

balnazzar · November 9, 2018, 2:31am

I was expecting something more than 10%

balnazzar · November 9, 2018, 2:34am

Unfreeze the model, run some more epochs, then report back.

As the model is frozen, the GPU is not leveraged properly, no matter how powerful it is.

I can add I get more or less the same timings, on a frozen model, with 1070, 1080ti, and a Tesla V100.