The manufacturers have finally understood what people buy these cards for. Their fan’s hysteresis is extremely aggressive, contrarily to Pascal/Turing blower cards.
Really learned a lot from this discussion and hopefully I can make a contribution. I’m in the process of putting together a 2 x RTX 3090 build. Here are the key components of the setup I’m considering:
Processor Cooler: Corsair H150i RGB PRO XT Hydro Series High Performance CPU Cooler
RAM: 128GB Corsair Vengeance DDR4 3000MHz (4 x 32GB)
Graphics Card: 2 x 24GB Nvidia Geforce RTX 3090
M.2 SSD Drive: 2TB Samsung 970 EVO PLUS M.2, PCIe NVMe
Power Supply: Corsair 1600W Pro Series Titanium AX1600i Digital Modular
Motherboard selection has been influenced by other posters such as balnazzar. Bear in mind I’ve only “built” one PC over a decade ago, so my PC-building knowledge is rudimentary. However, looking at the motherboard layout, it’s not clear to me whether fitting the two 3090 graphic cards in the PCIe x 16 slots (shown in red below) would invalidate using the NVIe M.2 slots (shown in blue below). Would there be enough room to fit the Samsung 970 EVO PLUS in the NVIe M.2 slot, and even if physically possible, would it likely get too hot? It looks like the graphics card would be right on top of my SSD drive.
Thanks balnazzar, appreciate the feedback. Will report back with photos and temperature info when everything is up and running. Might be helpful to others.
What you’re saying re open-air cards seems consistent with what I’ve read here (https://www.pugetsystems.com/parts/Video-Card/NVIDIA-GeForce-RTX-3090-24GB-Open-Air-13804): This variant uses a multi-fan cooling layout. This is great for keeping a single card running cool and quiet, but results in most of the heat the card generates being pumped back into the computer. That makes these a poor choice for use in multi-GPU systems. In those cases, a video card with a blower-style fan and rear heat exhaust would be much better.
That is something I will check again with the supplier. Thanks for the heads up.
Yes please. This could be an useful reference thread for a lot of people trying to build their DL rig.
Mind that there exist rocket 4.0 and rocket 4.0 plus. The latter has a much better controller.
Other alternatives: wd sn850, samsung 980pro.
Unfortunately the endurance for these new consumer drives is not that great… If you need better endurance, buy an enterprise gen4 pcie-card ssd. For example, the samsung PM1735 (it is rated DWPD 3). Furthermore these ssds are inserted in pcie slots and do have giant heatsink. The thermal throttling problem goes away.
Look on ebay, you’ll find the 1,6 Tb pm1735 at honest prices.
I’ll tell you what… Open-air OC cards are made for kids wanting to sqeeze every possible FPS in synthetic benchmarks. As a consequence, some of these cards do have the power limiter adjustable to crazy levels (up to 480W).
We should trod in the opposite direction, that is (as puged did in its review) optimizing power vs. performance. But even if you leave your cards at 350W, I already have ascertained that, for example, the Gigabyte Turbo never goes over real 350W. So with two cards you are at 700W. My 3960X without any GPU absorbs just over 300W from the wall outlet. That leaves you with a total of 1000/1050W.
The bottom line is that with 1200W you will be fine & spare a lot of money. Just buy a quality unit, so that it won’t lose efficiency working at ~85% of its rated power.
AND, while a 1200W will be more than enough to power all the stuff, I really urge you to run the cpu at 180-200W and the gpus at 300W maximum. In this convenient way, you will spare heat, noise, and electricity, and you will still get >90% of the rated perf (both cpu & gpus).
Some interesting benchmarks to compare different single and multi-GPU systems on different tasks.
What stands out to me comparing 2x3080 and 2x309 is how the gap between the two reduces from an average of 100% to just 50% uplift when using fp16. I suspect this is because for many models, when using fp32, only 10GB of VRAM is too little. For some, performance improves dramatically when using fp16, but for others, the little VRAM continues to be an issue even when in fp16 ― e.g., “transformerxlbase.”
has anyone expierienced thermal slowdown on the 3090?
a few days ago I saw that the training time per epoch increased after 2 days of training by about 5-10%. during the investigation I saw that thermal throttling kicked in. It seems that the whole system gets too hot after about 24 hours of training. I ordered some more case fans … hope that helps.
Has anyone had the same issues? you won’t get any warnings (dmesg, etc …) and also the GPU temps can be quite low (~60°) - so you’ll have to keep an eye on nvidia-smi -q -d PERFORMANCE
It’s strange that you got no limits reported. Anyhow, Ampere cards do have 93C as slowdown temp (not to be confused with thermal throttling temp, which is 86C).
May I ask which is your hardware config, and what kind of 3090(s) are you using?
Another thing: the gpus (as well as the cpus and other chips, like NICs, etc…) are sensitive to temperature. While they can work flawlessly near their maximum operating temp, one has to be conscious that they will deliver less performance.
I do have one gpu which got a better binning (as it happens almost always). It runs cooler and delivers slightly better performances. This gpu occupies the topmost position (i.e. gets more heat) but it still runs cooler. Note also that its fan spins slower.
At the beginning the gpus deliver 17 and 17.3 Teraflops respectively
in the middle they slip to 16.5 and 16.7
at the end they do 16.4 and 16.7
the good thing is that they don’t do worse than that (I tried a 15 min run). They top up after 3 minutes circa
Case airflow has little or no impact. I got a lot of Silverstone FHP141 fans. These are the most powerful 140mm fans in existence (171CFM each). Running them at min speed (500rpm) or at max speed (2000rpm) does impact gpu temps only by 2C.
Bottom line: it is normal that the gpus perform the final epochs a little slower.
Adding some more info to the topic, see below: that’s a screenshot from the awesome Sanyam’s setup (he gave me explicit permission to share it publicly).
He got a 3090, a Titan RTX, and a 2080ti. The 3090 is (as you may see some post above) the MSI 3090 Ventus, a gaming model which I think is factory-overclocked.
Note that gpu_burn reports less than 14 Teraflops, and yet the card is at 76C, far from any throttling temperature. That’s even worse than @shreeyak’s whose card delivers a minimum of 15.3 Teraflops despite being stuck at its throttling temp.
This is definitely strange. Maybe it could be since Sanyam’s card is plugged into an extender cable. @init_27, if you have time, it would be great to snap the 3090 directly into a pcie slot and run gpu_burn for 4-5 minutes, in order to check if it delivers more consistent performances. Thanks
what I’ve read is, that the thermal throttling despite a low GPU temp could be related to memory temperatures (you’ll find a lot information about memory temperature issues in the crypto-mining community).
I couldn’t find a way to get the actual memory temps but the slowdown is definitely a temperature issue. If I leave the case open the GPU temp doesn’t really go down (as you said before maybe 1-2°) but it goes back to full speed in about 30 seconds. I received my new case fans today so I hope that helps with the case closed .