GTX 2080/2080Ti RTX for Deep Learning?

Gabriel_Syme · July 28, 2019, 12:07pm

Thanks that is great! Could it be that there are more fully connected layers there? Really not sure lol.

p.s. I went and bought the RTX 2080 Super Aorus! Just came so haven’t tried, will benchmark on pix2pix since it’s the model I use the most.

scoodood · September 4, 2019, 3:44pm

hi guys, I am having trouble deciding which GPU to pick for my DL rigs. Which one would you recommend?

2080 Super $750 (new)
2070 Super $600 (new)
1080 TI $550 (used, never used on crypto-mining)

Thanks

balnazzar · September 10, 2019, 2:18pm

I do not consider the 2080 as a cost-effective option: it got the same amount of memory as the 2070, and delivers ~15% more speed while costing substantially more.

That leaves us with the 2070 vs 1080ti.
The latter is still the best option in terms of memory for money. It will be faster than the 2070 while operating in fp32, but a bit slower in fp16 (note that pascal cards are capable of operating in fp16 mode, thus effectively doubling the vram size, but the performance gains are maginal when compared with the RTXs).

Since both of them cost more or less the same we can summarize as follows:

the 2070 is a bit faster in fp16, and draws less power.
the 1080ti has almost 50% more memory, and is faster in fp32.

balnazzar · October 1, 2019, 1:29pm

It scales very well. I forgot the page address, by I did read a report by pugetsystems in which they benchmarked two titan rtx both with and without nvlink. Two cards scale rather well even without nvlink.
If you want the blower version, buy the quadro rtx 6000. It has identical specs. But bear in mind that just one slot of separation would provide the necessary space for two titans to breath, given the case has a good overall airflow.

tcapelle · October 22, 2019, 2:44pm

I am really happy with my RTX2080ti, btw you can get 2 x 2080ti for the price of a titan.

balnazzar · October 27, 2019, 5:38pm

Actually, you can get two and half 2080ti for the price of a single titan

But there are other considerations in favour of a single titan… For example… While it’s true that two 2080ti have more or less the same amount of memory of a single titan, not all the stuff one could work with is parallelizable. If you are forced to work with a single card, 11Gb could be a limiting factor.
I urge you to try and train a big efficientnet (b5 to b7) upon a single 2080ti. You’ll have an unpleasant surprise.

P.S.: Your tabular data repo is great!

tcapelle · October 28, 2019, 8:17am

Of corse a bigger card is better, but you can paralellize training if your batchsize is bigger than 1.

balnazzar · October 28, 2019, 1:09pm

Yes, but the big question is: can you do it always?

tcapelle · October 28, 2019, 1:57pm

Probably not, it depends a lot on the type of data you work with. I almost never work with high res images, or other heavy data. So for my workflow, two cards would be better. Gosh, I would love to have a second card right now =).
@balnazzar which repo?

balnazzar · October 28, 2019, 2:07pm

Not an inconsiderable piece of work.

tcapelle · October 28, 2019, 2:23pm

Hehe, I just wondered, because that’s my timeseries repo, and I am currently working with tabular models. Assembling Image+Tabular data.

balnazzar · October 28, 2019, 2:27pm

Keep us posted

balnazzar · January 3, 2020, 11:05am

Posting here since it’s somewhat the “main rtx thread”, afaik.

I previously owned a couple 1080TIs, sold and replaced with three 2060 Supers since I had numerous issues with convergence as I used the Pascal cards in 16-bit computation. Convergence did not occur in some cases, and when it occurred, it was slower with respect to 32-bit computation.

Now I am at my parents’ for Christmas vacation, and I’m using my father’s PC (windows and a GTX 1070). Since I was doing a toy project of mine with such machine, I noticed that fp16 actually converges, as of now, even with Pascal, so I ran a more systematic experiment. Look:

I used imagewoof since I wanted data upon which convergence is somewhat a bit harder with respect to MNIST or cifar.

A few things to notice:

fp16 actually shows better convergence vs. fp32
speed is marginally better, presumably due to the lack of tensor cores
I did a kernel reset prior to creating the fp16 learner, in order to accurately measure peak mem occupation
peak memory occupation was measured with hwmonitor. It shows almost half occupation in fp16, so we are fairly certain the computation actually happens in half precision mode.
I experimented with other datasets (text and tabular included), obtaining the same results.

I’d like to ask @jeremy if he made some fancy trick to the library to make convergence in fp16 possible even with Pascal and/or if maybe the latest Pytorch has something like that.
Note that I was unable to achieve decent convergence with my 1080TIs even with Nvidia Apex installed.

Note: Jeremy, I’m tagging you since you participated in this thread and so it should be pertinent to you. Thanks.

I think it’s important to clarify this, since many deep learning practitioners, and especially beginners, don’t have deep pockets. If convergence occurs even with Pascal (in any use case), it would mean that one can buy a 8gb card (1070) on ebay for ~200$, or a 11gb card (1080ti) for ~450$ while still being able to effectively double the VRAM. In such convenient way, one can spare a lot of money with respect to the expensive RTXs.

jeremy · January 4, 2020, 11:49pm

Really it’s just due to ongoing improvements from NVIDIA with cudnn and AMP, and @sgugger has been fixing things along the way too.

balnazzar · January 5, 2020, 11:12am

Thanks Jeremy!

So I think we could summarize some points.

16-bit computation now works flawlessly with Pascal, although without big speed gains. It doubles (or almost…) the memory though.
If you have capable Pascal cards maybe it doesn’t worth the hassle to upgrade to (consumer) Turing.
It would be interesting if @sgugger, provided he has time to spare, could summarize the improvements he made to the library to perfect half-precision training.

jeremy · January 5, 2020, 10:28pm

I don’t know about that - many models get 3x or better speed improvements!

balnazzar · January 6, 2020, 10:02am

Having a 3X speedup is a really nice thing, and even for stuff which gets more modest speed gains, better having them than not. One can perform a lot more experiments in the same amount of time, ad avoid the frustration of waiting forever to train big models.

But allow me a couple of considerations:

First and foremost, I am rather bothered by the fact that Nvidia is blithely exploiting its de facto monopoly. The price ranges have shifted upwards (the previous consumer top dog, 1080ti, had an average price of 750$, now the 2080ti retails for 1300$). Clearly not satisfied, they ceased to improve the memory amount too: previously, for each generation they doubled (or almost) the memory amount: see for example Maxwell and Pascal.
But from Pascal to Turing we got the same amount of memory per segment (8gb for midrange, 11gb for top-tier) for substantially higher prices. It’s clear they want to squeeze every penny from us…
Note that meanwhile the state-of-the art models keep growing in size, see for example teh transformers and the big efficientnets
But then with 16-bit computation you almost double your memory. True, but is seems you can do it with Pascal too. You’ll just got to wait more to finish your training.

For everyone not operating in a time-critical production environment, I think the fundamental question is: is there something i can do with Turing that cannot do with Pascal. If the answer is no, as it seems, I think we shouldn’t second these crazy price policies, otherwise they’ll feel forever encouraged to give us less in exchange for more money.

As a footnote, see: https://blog.exxactcorp.com/whats-the-best-gpu-for-deep-learning-rtx-2080-ti-vs-titan-rtx-vs-rtx-8000-vs-rtx-6000/

Here, the BASE model trained using the RTX 2080 Ti (based on vanilla settings for transformer model) clearly is inferior. Note: BIG model failed training on 2x 2080 Ti System with default batch size.

It was two 2080ti. 2600$ worth of GPUs and one cannot even train a transformer decently??

These were my two cents.

Finally, could you tell me about one example or two upon which you got the 3x speedups? I am eager to confront the 2060super and the 1080ti systematically. Thanks.

jeremy · January 6, 2020, 5:39pm

My goal isn’t really to send NVIDIA a message, but to get the best price/performance ratio I can. At the moment that probably means using a GTX 2070.

IIRC we got 2-3x perf increase with fastai2’s AWD LSTM. NVIDIA have examples with up to 5x difference. Using DALI is a good idea if you’re doing vision stuff BTW to ensure you’re feeding the GPU fast enough.

balnazzar · January 7, 2020, 4:32am

DALI is really interesting, although I didn’t have the chance to try it till now. Will do asap, though.

Same here. When I ditched my 1080TIs, I opted for the 2060 super… It gives you some 5-8% less performance w.r.t the “old” 2070, but for less money. Honestly, I made the transition since the 1080TI is still very popular among gamers, and selling them I was able to get three 2060S just adding 100eur. Thinking about adding the fourth right now…

And still, I really hope to see fastai working with ROCm soon (https://rocm.github.io/pytorch.html). Think about the Radeon VII: almost half the cost of a 2080ti, and 16Gb of VRAM. Capable of 16-bit computation, too.

Note: I have no interest in AMD stocks.

Allow me to bother you with one last question: did you ever encounter some deep learning task you could not parallelize with DataParallel? (In other words, something for which one single titan rtx would have been required since you couldn’t use three 8gb-class cards)