GTX 2080/2080Ti RTX for Deep Learning?

I just installed the RTX 2070, it works fine in standard precision but when I try the .to_fp16(), or “leanr= to_fp16(create_cnn(…))”, it crashes my kernel without an error message in Jupyter Notebook.
So rather hard to debug :slight_smile:

I’m using Ubuntu 16.04 and nvidia 410.73.

Strangely the .to_fp16() command works with my 1080Ti without crashing the kernel (but no performance boost), thus my post yesterday https://forums.fast.ai/t/gtx-2080-2080ti-rtx-for-deep-learning/26783/60?u=ericpb

I am also getting

The kernel appears to have died. It will restart automatically.

when I try the `“leanr= to_fp16(create_cnn(…))”,
using ubuntu and nvidia 396.44 on a V100…

1 Like

It’s not, in my opinion.

Less powerful, less memory, troubles with .to_fp16()

Once they address such problems, a couple of 2070 for 1 grand will be great: one card with 16gb, and more power than the 1080ti (maybe even than the 2080ti).

That’s very much spot on: for €550-600 one can get either a used 1080Ti 11go (last-gen best in class) or a new RTX 2070 8go.
The key difference is access to Tensor Cores + FP16 (or mixed precision) with the 2070, potentially doubling its VRAM or at least on-par with the 1080Ti.

Getting the RTX (whatever model ?) to activate its Tensor Cores for either PyTorch or TensorFlow is not yet a simple “Plug & Play” as I discover myself, making a simple benchmark such as cifar10 (thank you @sgugger :hugs:) quite challenging.

BTW I’d love to hear from the recent owners of 2080/2080Ti if/how they managed to run Fastai in mixed precision, and how it compares to standard FP32 training.

2 Likes

I just tried using to_fp16() with a 2080Ti and didn’t face any crash. BTW I don’t see any performance improvement on this simple test (classifier with 1k images)

I ran a bunch of benchmarks on the Pets notebook, someone’s modified version of it, using my 2080.

|RN|FP|BS|SZ|Time|Error|
|34|16|100|224|4_47|0.053|
|34|32|100|224|5_16|0.055|
|34|32| 48|320|7_29|0.053|
|50|32| 32|320|9_19|0.044|
|50|16| 32|320|8_40|0.043|
|50|16| 32|299|8_07|0.044|
|50|16| 64|299|7_20|0.041|

I think I have some bottleneck as another posted similar times on a 1060, but doubling the batch size was as simple as wrapping the learner in to_fp16(…).

1 Like

I don’t like used cards. Often they come clogged with dust and dirt, and getting them clean is not an easy task. I’d say buy a new 1080ti, or wait until prices drop and go for 2 2070.

Hej JulienM,

Thank you for replying as a RTX 2080Ti owner, that helps a lot !

So you didn’t face any crash while using to_fp16(), even though you didn’t see any performance improvement (secondary question: metrics or speed/duration ?).

Could you share more info about your experience: with an official Fastai mooc notebook, a Kaggle kernel or else ?
So I can try to replicate your exact code while using the RTX 2070.

BR,

EPB

Hej Ralph !

Thank you for providing more info.

When you ran a bunch of benchmarks on the Pets notebook: did you activate the TensorCores and/or FP16/MixedPrecision in any specific way ?
I couldn’t find a specific TensorCores/FP16 code in the version you linked.

BR,

EPB

All I did was switch between:

#learn = to_fp16(create_cnn(data, models.resnet34, metrics=error_rate))
learn = (create_cnn(data, models.resnet34, metrics=error_rate))

The linked notebook has more epochs than the original, which explains the slower times in the benchmarks.


Going back to the original notebook, the change above increases my max batch size from 192 to 375,


I’m running cuda 9.2 and nvidia 410.73, but 410.66 also worked.

1 Like

I created a post on PyTorch forums regarding a simple test to ensure TensorCores are used with a RTX 2070.

2 Likes

Hi Peter! I’ve installed 2080 into my machine (sabertooth x99, i7-6800K), so now two GPUs are on board. So far both of them seem to work :smile:

I am going to figure out how they perform in comparison with each other using advice collected in this thread. Because so far I haven’t done too many fp16 experiments to unleash the full potential of RTX card.


Talking about benchmarks, I think it would be great if we can test our 1080Ti, 2070, 2080, and 2080Ti cards in various training settings, like using various datasets and architectures. I know there are some benchmarks on the internet, however, we can make them available not only as metrics but as the code as well using fastai mixed precision training capabilities. So anyone can reproduce and test their cards/configurations. Probably something similar is already on the list to be implemented by fastai team?

2 Likes

I ran the notebook as a Python script (suggested in the PyTorch forum) into the terminal.

The error generated when running to_fp16() is Floating point exception (core dumped)

I’m encountering the same issue with .to_fp16() when running the Cifar10 notebook straight from Fastai github:

Cifar10

One of the moderators on PyTorch forum suggested a cause but it’s beyond my pay-grade :blush:

could it be you are using torch.float data somewhere in torch.half layers?
Could you post your model definition?

Can anyone help to answer ? :heart_eyes:

cc @sgugger

In the meantime, a quick benchmark of Cifar10 for the RTX 2070 vs the 1080Ti, in FP32.

The 2070 is achieving almost identical speed (52sec vs 49sec per epoch) with a batch size more than half smaller !?!
Am I doing something wrong ?

GTX 1080Ti 11Gb

RTX 2070 8Gb

When I run your code on the 2080, learn.fir_one_cycle gives me the usual error the first time but works fine the second go - no kernel restart.

RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch-nightly_1539945974892/work/aten/src/THC/THCGeneral.cpp:421

I can run as high as BS = 184 with a 45 sec average epoch.
Adding to_fp16 drops the time to 40 seconds.
Increasing BS to 368 drops time to 38.

1 Like

I was expecting something more than 10% :thinking:

1 Like

Unfreeze the model, run some more epochs, then report back.

As the model is frozen, the GPU is not leveraged properly, no matter how powerful it is.

I can add I get more or less the same timings, on a frozen model, with 1070, 1080ti, and a Tesla V100.

1 Like

I think this repo is dedicated to explore mixed-precision with PyTorch.

I’m running the scripts with the World Language Model, and can see a slight performance boost (+15%) with --fp16 on the 2070.

Hi Ilia, sorry for the tardy reply. That’s a great idea, although, as you suspect, I doubt original. I offer a twist … assemble a few ‘small’ test sets of data to analyse. Place emphasis on GPU performance, NOT analytucal accuracy, but time taken for each epoc. This way we can compare horrid little machines to real grunt boxes without having to wait days or weeks for analytucal results. Hence a broad spectrum of performances can be readily comparde and duplicared by anyine wishing to test their ‘specual rigs’ (please pardon gamer parlance). What do you think Ilia, or anyone else? All suggestions welcome, no offence will be taken; I’m too old for that.
I’m flat out busy on distractions until the end of the month. Then December to catch up, then into FasrAI with ears pinned back.
Cheers Ilia, Peter Kelly.