Recommendations on new 2 x RTX 3090 setup

balnazzar · January 16, 2021, 3:29pm

He already tried. Little or no impact upon the temps.

Actually, it is. Even with 4 GPUs and less case fans: Quad GeForce RTX 3090 in a desktop - Does it work? | Puget Systems

Read also above: his fan is already at 97% duty, and still the card hits 86C

redturtle · January 16, 2021, 3:38pm

Ah OK I missed that. Probably the best option is to rearrange the GPUs so the top GPU has an unobstructed inlet.

balnazzar · January 16, 2021, 3:46pm

The manufacturers have finally understood what people buy these cards for. Their fan’s hysteresis is extremely aggressive, contrarily to Pascal/Turing blower cards.

redturtle · January 16, 2021, 3:51pm

I’ve always had to modify the fan curve for Pascal/Turing. Good to hear this isn’t an issue with this Ampere card. One less thing to worry about.

balnazzar · January 17, 2021, 12:43pm

Two things:

Try to put one card out of the slots with a pcie extender cable. You can put it right in front of the side fans.
What’s the idle power draw of each card?

greent · January 17, 2021, 12:44pm

Really learned a lot from this discussion and hopefully I can make a contribution. I’m in the process of putting together a 2 x RTX 3090 build. Here are the key components of the setup I’m considering:

Case: Fractal Define Design 7 XL (or Corsair Obsidian Series 750D)
Motherboard: Gigabyte TRX40 Designare
Processor: AMD Ryzen Threadripper 3960X
Processor Cooler: Corsair H150i RGB PRO XT Hydro Series High Performance CPU Cooler
RAM: 128GB Corsair Vengeance DDR4 3000MHz (4 x 32GB)
Graphics Card: 2 x 24GB Nvidia Geforce RTX 3090
M.2 SSD Drive: 2TB Samsung 970 EVO PLUS M.2, PCIe NVMe
Power Supply: Corsair 1600W Pro Series Titanium AX1600i Digital Modular

Motherboard selection has been influenced by other posters such as balnazzar. Bear in mind I’ve only “built” one PC over a decade ago, so my PC-building knowledge is rudimentary. However, looking at the motherboard layout, it’s not clear to me whether fitting the two 3090 graphic cards in the PCIe x 16 slots (shown in red below) would invalidate using the NVIe M.2 slots (shown in blue below). Would there be enough room to fit the Samsung 970 EVO PLUS in the NVIe M.2 slot, and even if physically possible, would it likely get too hot? It looks like the graphics card would be right on top of my SSD drive.

Happy to receive other feedback (good or bad) or suggestions regarding case, MB, SSD, etc. Thanks.

balnazzar · January 17, 2021, 3:42pm

I have the first one. It’s very good. An alternative if you don’t like dust: Silverstone MM01B.

A bit overkill for two cards, but one of the best PSUs ever made.

Yes.

Blower & FE cards: no, it will stay cool. Open-air cards: possibly (I’d say: much probably) it will get hot.

It may be the case to opt for a gen4 ssd and ECC ram. All the other stuff seems to be quite OK

greent · January 17, 2021, 4:00pm

Thanks balnazzar, appreciate the feedback. Will report back with photos and temperature info when everything is up and running. Might be helpful to others.

What you’re saying re open-air cards seems consistent with what I’ve read here (https://www.pugetsystems.com/parts/Video-Card/NVIDIA-GeForce-RTX-3090-24GB-Open-Air-13804):
This variant uses a multi-fan cooling layout. This is great for keeping a single card running cool and quiet, but results in most of the heat the card generates being pumped back into the computer. That makes these a poor choice for use in multi-GPU systems. In those cases, a video card with a blower-style fan and rear heat exhaust would be much better.

That is something I will check again with the supplier. Thanks for the heads up.

I’ll look into gen4 ssd. Something like the Sabrent Rocket NVMe 4.0 (https://www.amazon.co.uk/dp/B07TLYWMYW?linkCode=gs2&tag=gchblog-21)?

Originally I selected a 1200W power supply after doing the maths based on wattage but was strongly advised to increase to 1600W.

balnazzar · January 17, 2021, 4:18pm

Yes please. This could be an useful reference thread for a lot of people trying to build their DL rig.

Mind that there exist rocket 4.0 and rocket 4.0 plus. The latter has a much better controller.
Other alternatives: wd sn850, samsung 980pro.
Unfortunately the endurance for these new consumer drives is not that great… If you need better endurance, buy an enterprise gen4 pcie-card ssd. For example, the samsung PM1735 (it is rated DWPD 3). Furthermore these ssds are inserted in pcie slots and do have giant heatsink. The thermal throttling problem goes away.
Look on ebay, you’ll find the 1,6 Tb pm1735 at honest prices.

I’ll tell you what… Open-air OC cards are made for kids wanting to sqeeze every possible FPS in synthetic benchmarks. As a consequence, some of these cards do have the power limiter adjustable to crazy levels (up to 480W).
We should trod in the opposite direction, that is (as puged did in its review) optimizing power vs. performance. But even if you leave your cards at 350W, I already have ascertained that, for example, the Gigabyte Turbo never goes over real 350W. So with two cards you are at 700W.
My 3960X without any GPU absorbs just over 300W from the wall outlet. That leaves you with a total of 1000/1050W.
The bottom line is that with 1200W you will be fine & spare a lot of money. Just buy a quality unit, so that it won’t lose efficiency working at ~85% of its rated power.

AND, while a 1200W will be more than enough to power all the stuff, I really urge you to run the cpu at 180-200W and the gpus at 300W maximum. In this convenient way, you will spare heat, noise, and electricity, and you will still get >90% of the rated perf (both cpu & gpus).

mrfabulous1 · January 18, 2021, 10:02am

Absolutely! This thread is real world stuff, which seems almost impossible to gain by reading the manutactures info only.

Great useful work for the many!
Cheers mrfabulous1

iamgianluca · January 19, 2021, 6:26pm

Some interesting benchmarks to compare different single and multi-GPU systems on different tasks.

What stands out to me comparing 2x3080 and 2x309 is how the gap between the two reduces from an average of 100% to just 50% uplift when using fp16. I suspect this is because for many models, when using fp32, only 10GB of VRAM is too little. For some, performance improves dramatically when using fp16, but for others, the little VRAM continues to be an issue even when in fp16 ― e.g., “transformerxlbase.”

balnazzar · January 19, 2021, 7:13pm

Something is not right with Tensorflow. And it’s not Ampere support, since A100 and A6000 perform as expected.

balnazzar · January 19, 2021, 7:16pm

As for Pytorch, the gap between 3090 and A6000 seems to be a bit too large.

tcapelle · January 20, 2021, 9:41am

there is probably an error here. My RTX8000 still going strong =). The A6000 looks very promising also.

florianl · January 24, 2021, 12:22pm

has anyone expierienced thermal slowdown on the 3090?

a few days ago I saw that the training time per epoch increased after 2 days of training by about 5-10%. during the investigation I saw that thermal throttling kicked in. It seems that the whole system gets too hot after about 24 hours of training. I ordered some more case fans … hope that helps.

Has anyone had the same issues? you won’t get any warnings (dmesg, etc …) and also the GPU temps can be quite low (~60°) - so you’ll have to keep an eye on nvidia-smi -q -d PERFORMANCE

balnazzar · January 25, 2021, 11:37am

It’s strange that you got no limits reported. Anyhow, Ampere cards do have 93C as slowdown temp (not to be confused with thermal throttling temp, which is 86C).

May I ask which is your hardware config, and what kind of 3090(s) are you using?

balnazzar · January 25, 2021, 1:56pm

Another thing: the gpus (as well as the cpus and other chips, like NICs, etc…) are sensitive to temperature. While they can work flawlessly near their maximum operating temp, one has to be conscious that they will deliver less performance.

Look below:

gpu_burn, at the end of a 10 seconds run:

gpu-burn, in the middle of a 4 minutes run:

gpu_burn, at the end of the above run:

Conclusions:

I do have one gpu which got a better binning (as it happens almost always). It runs cooler and delivers slightly better performances. This gpu occupies the topmost position (i.e. gets more heat) but it still runs cooler. Note also that its fan spins slower.
At the beginning the gpus deliver 17 and 17.3 Teraflops respectively
in the middle they slip to 16.5 and 16.7
at the end they do 16.4 and 16.7
the good thing is that they don’t do worse than that (I tried a 15 min run). They top up after 3 minutes circa
Case airflow has little or no impact. I got a lot of Silverstone FHP141 fans. These are the most powerful 140mm fans in existence (171CFM each). Running them at min speed (500rpm) or at max speed (2000rpm) does impact gpu temps only by 2C.

Bottom line: it is normal that the gpus perform the final epochs a little slower.

balnazzar · January 25, 2021, 7:50pm

Adding some more info to the topic, see below: that’s a screenshot from the awesome Sanyam’s setup (he gave me explicit permission to share it publicly).
He got a 3090, a Titan RTX, and a 2080ti. The 3090 is (as you may see some post above) the MSI 3090 Ventus, a gaming model which I think is factory-overclocked.

Note that gpu_burn reports less than 14 Teraflops, and yet the card is at 76C, far from any throttling temperature. That’s even worse than @shreeyak’s whose card delivers a minimum of 15.3 Teraflops despite being stuck at its throttling temp.

This is definitely strange. Maybe it could be since Sanyam’s card is plugged into an extender cable. @init_27, if you have time, it would be great to snap the 3090 directly into a pcie slot and run gpu_burn for 4-5 minutes, in order to check if it delivers more consistent performances. Thanks

florianl · January 27, 2021, 9:53pm

what I’ve read is, that the thermal throttling despite a low GPU temp could be related to memory temperatures (you’ll find a lot information about memory temperature issues in the crypto-mining community).

I couldn’t find a way to get the actual memory temps but the slowdown is definitely a temperature issue. If I leave the case open the GPU temp doesn’t really go down (as you said before maybe 1-2°) but it goes back to full speed in about 30 seconds. I received my new case fans today so I hope that helps with the case closed .

redturtle · January 28, 2021, 12:44am

Timely article about ampere memory temps: