Comparing the RTX 2060 vs the GTX 1080Ti, using Fastai for Computer Vision

My 1080Ti is a blower version, so it expels the heat directly out of the case through a vent below the DisplayPort/HDMI connectors. It’s noisier than a regular gpu with large fans, because of the wind tunnel effect.

As I tried to explain in my article, I tested both cards in the non-display slot (port “1” iirc) while the other was handling dual-monitors in port “0”. You can see it being checked at the start of every Jupyter notebooks in my GitHub repo.

I also did a quick comparison running the Cifar-10 test with the 1080Ti in port “0”, as shown in the “Bonus” charts at the end of the article.
Basically I lost 5 to 10% of computing power (on a 1080Ti) if it’s handling my dual-display as well. Might be more for a smaller GPU.

According to the manufacturer specs, the MB PCIe’s can handle x16 for a single GPU, or two x8 for dual GPUs. I’m not sure dropping from x16 to x8 has a great impact on performance. But since I tested the cards into the same setup, this should render the impact neutral hopefully.

1 Like

@crayoneater I think you would experience a large performance hit due to the x4 PCIe lanes.

Can you double check the lanes your card is getting by grepping it from the terminal (use lscpci and grep to collect the GPU portion)

I think the thermals are slightly nasty-I might be wrong, please excuse me if I am-what I know is-you need to ensure that you get stable temperatures over long duration of training loops.

Another suggestion: If you want to use both the cards together, you might want to increase your RAM as I’m not very sure if 16GB would be able to handle both the cards in //. The workaround for this can be allocating more SWAP space.

Can you also try the GTX1080Ti at max batchsize (to benefit from the extra ram), I am pretty sure the 1080 can fit a bs=512, or even 1024.

Hi Sanyam, thanks for the inputs My board is an SLI board so the cards run at x8/x8. For “x4” I was referring to the extra card slot Sritanu’s Z370A board, which does not appear to be an SLI model.

Longterm I do worry about the temperatures. I’ll try experimenting with fan speed settings, extra fans, possibly a different case or additional cooler system. So far I have not had an issue with RAM, and a 16GB swap file. I am wondering if FP16 allows you to get around the usual “RAM >= 2x VRAM” recommendation?

I had done some tests and it wasn’t as dramatic a speed increase, if any, than when using FP16 on the 2060. But that might be due to some bottlenecks on my PC too.

Interesting post, @EricPB, and very good article.

Did you check the difference in memory occupation both on the 2060 and on the 1080ti?

Now, I have some considerations about Pascal vs. Volta/Turing in FP16, but it will be better to split them in two: memory and speed.

  1. Speed

First of all, please note that a Pascal card should have 1/32 fp16 performance with respect to fp32. This doesn’t happen, and I’d be curious to know why.
That said, let’s talk about what actually happens. Interestingly enough, different people find very different results.
You found a slowdown of ~5-15%.
I found a slight speedup (see below)
Other people found a substantial speedup: https://hackernoon.com/rtx-2080ti-vs-gtx-1080ti-fastai-mixed-precision-training-comparisons-on-cifar-100-761d8f615d7f

Again, I’d be very glad to know why that does happen.

  1. Memory.

One advantages of volta/turing is that you can almost double your memory thanks to fp16, so a 2060 appears to be on par with a 1080ti even when it comes to memory.
But this remains valid even for Pascal: memory occupation is almost halved on my 1080ti as I train in fp16.

I ran numerous benchmarks in the past, but as I did read your article, I decided to run some additional ones just to have fresh result with fastai 1.0.45 and nvidia apex, which I installed both upon my machine at job (tesla V100), and at home (1080ti).
Mind that I ran the tests upon different datasets since I was in a hurry, but what counts in the end is the net difference between fp16 and fp32 on both cards.

1080ti:

Note that:

  • between fp32 and fp16 I restarted the kernel and reinstantiated imports, data, etc.
  • memory occupation in fp32 wa 9127 Mb, and 5081 in fp16
  • 224px images, batch size=256

Tesla V100-DGXS-32Gb:

Note that:

  • again, between fp32 and fp16 I restarted the kernel and reinstantiated imports, data, etc.
  • memory occupation was 14883 Mb in fp32, and 7751 Mb in fp16.
  • we do not observe substantial speedups in fp16 on the Nvidia flagship (incredibly… Did I mess with something?).
  • this is not a cloud instance. I have direct access to the machine.
  • 700px images, bs=48
2 Likes

Note that you don’t need SLI or NVlink to train in parallel. And even in x8/x8 gen3 setup, you should enjoy good performances. Try and use your two cards with DataParallel.

Thank you !

I don’t have the RTX 2060 anymore to run additional tests.

Back then, I tried to run:

  • Cifar-10 in three setups (1080Ti as a main card handling dual-display as well, 1080Ti on 100% training/no display overload, 2060 on 100% training),
  • Cifar-100 in two setups (1080Ti and 2060 on 100% training, no display).

Overall, each notebook ran all ResNet models for 30 epochs, that’s what I recorded as “Time to complete 30 epochs”, and running all 5 notebooks took about 90 hours over 5-6 days in multiple sessions.

FWIW, my PC is using an AMD Ryzen 1700X CPU, while @init_27 and @Ekami have Intel I7-7700/8700K CPU.
Could that explain the difference in performance in FP32/FP16 with the 1080Ti ?

On KaggleNoobs Slack, there were some rather technical discussions on AMD vs Intel. Check the threads where @laurae (who runs the “LauraePedia” channel) intervenes, he has an amazing knowledge on hardware low-level libraries.

BTW, how do you check for “GPU Memory Occupation” while training ?
I wanted to check it but didn’t know the command :sunglasses:

1 Like

@balnazzar Someone from Nvidia has recently reached out to me informing that I had used un-optimised libraries, they’ve pushed out even more optimizations which means that the speedup will be even more now (I’ll report my experiments soon).

@EricPB Just set the nvidia-smi tool to loop by:

nvidia-smi -l 1

This will show the mem usage by all of the processes.
I kept increasing the batch_sizes to values until I got OOM errors.

2 Likes

@EricPB

I think not. Definitely. An eight-cores Zen has plenty of power to fill whatever GPU.
For the record, I have a xeon E5-2680v2 on the machine with the 1080ti (10C/20T).

I’ll check the Kaggle thread you seggested, however.

watch nvidia-smi will do, however I suggest gpustat (pip install gpustat), which is much more compact. Example:

If you have time to spare, run some test for memory occupation upon your 1080ti in fp16 vs. fp32.

@init_27

Thanks, that would be awasome. Would you anticipate something?

Apart from that, what really surprised me was the substantial speedup (~15%) recorded even on the Pascal card. I recorded some 8%, but the point is that Pascal should slow down by 32 times when using fp16!

And still, it is clear that Pascal were working in fp16, since the memory occupation was almost halved.

1 Like

If you are into hardcore understanding of CPU/GPU mechanics, you should deffo join https://kagglenoobs.slack.com/ and look for the channel “Lauraedia”, it’s seriously serious :yum:
Can be a bit blunt if you don’t master the subject (like me), so you are warned :rofl:

1 Like

Doing it right now…

Awful, I’m quite a sensitive guy! :face_with_raised_eyebrow::stuck_out_tongue_winking_eye:

2 Likes

@balnazzar I think RTX would have even further speed improvs. Not sure about the GTX cards.

Would you be interested in running another set of benchmarks that the Nvidia peeps have suggested to me?
My orignal experiments were with Tuatini, but it would be rude to bother him again (The orignal test iteself took too long).

I just have a 2080ti card so can’t do comparisions.

Please let me know.

Yes, of course! I do have pascal/volta, but don’t have turing. Together, we can do significant comparisons.

Another thing…: Another set of benchmarks from 2017 seems to confirm what you found for pascal: the speedup in fp16 is not dramatic, but it’s still there.

Specifically:

1 Like

I ran a couple experiments, this time a bit more systematic, using ipyexperiments by @stas.

You’ll find the nbs here: https://github.com/terribilissimo/otherstuff

Other than memory and timings, pay attention to the losses: if one has to do more epochs, the speedup deriving from fp16 is of little use.

Note for @stas: yours is an awesome tool, but it does not seem to work with the teslas. Maybe teh ones I use (dgx station) are a bit different from the ones usually found in cloud instances?

I only tested it with my GTX card. In order to sort out any issues please post the details of what’s not working in this thread: IPyExperiments: Getting the most out of your GPU RAM in jupyter notebook Thank you.

And thank you for your kind words, @balnazzar - I’m glad you find it useful. I think it is still a bit clunky and evolving so any feedback for improvement is welcome.

2 Likes

It would be interesting to compare the new 2060 super with 8gb ram. Not sure if the 2060 super not having NVLink would be an issue for anyone.

I got three of them (blower version), which replaced my previous two 1080ti since Pascal shows issues of convergence when you use it in FP16.

Essentially, they are equivalent to the 2070 (non-super), at a lower price point and TDP. Thus, the 2060S got an amazing price/performance ratio, the best among the cards with 8Gb.
I paid ~1000EU for the three of them (but sold the 1080ti at the same price). With a TDP of 175W, they don’t tax the power supply so much, contrarily to the more power-hungry siblings.

Note that in any task which can be parallelized with DataParallel, you got 24Gb of vram (= titan rtx) for just 1000EU/$, which, together with 16-bit training, allow you to train even big transformers (except for the few biggest). If you motherboard allows you to stack four of them together, that’s even better.

NVLink: the NVLink for the nvidia consumer segment essentially is a toy, much different from the NVLink you would find on titan/quadro/tesla. Forget it, you’ll be fine with the pcie bus, as long as you get at least 8 lanes per card.

2 Likes

That’s good news. I’m considering two non-blower versions with a 2.7 slot width (XC ultra) due to the higher demand and resale of the non-blower cards. I’ve got the airflow and EVGA thinks it’d be a good setup.

If you got the airflow, two non-blowers will be ok and you can resell them easily, particularly if they are evgas.

1 Like