Recommendations on new 2 x RTX 3090 setup

Hey Andrea, sorry to hear about the high electricity bills in Europe. I hope you enjoy the A6000. It looks like a good compromise.

How was your experience with distributed training with the 2xRTX 3090? Just a curiosity for now. I was wondering if it works with all architectures straight out of the box or it can be painful to get it to work.

1 Like

I recently built a machine w/ the 16 core TR Pro and this MB w/ 2x3090 w/nvlink. I’m pretty happy with it so far. I have not tested w vs w/o nvlink distributed training so no comment on if that makes a difference, but it was fairly cheap to add.

I’ve done a little experimentation with 2x3090 distributed training and was seeing a near linear scaling (1GPU 5:58 vs 2 GPU 3:12 epoch) on a dataset of maybe several thousand images training a unet. I did not run into any issues getting it to work.

I’m happy to run a benchmark if someone wants to provide a script.

16c TR Pro, 2x3090 EVGA Hybrid w/ nvlink, 128GB RAM (4x32), 980 Pro NVME - Very similar to lambda labs except everything is hybrid/water cooled

1 Like

Thank you for sharing your experience Mat. I’ll give 3/4 more months before deciding if it is worth adding a second 3090 to my setup — I would definitely enjoy the linear increase in speed. Prices are finally expected to go down a little.

1 Like

The main thing that I noticed is that it’s difficult to properly cool two 3090s. While it’s relatively easy to make sure the core stays under the throttling temp, it’s almost impossible to do the same for the GDDR6X chips on the back… I also collaborated to a project with two liquid cooled 3090s… No effect. The vram on the back overheats all the same. Maybe with an active liquid cooled backplate it would be different, but at the price of a much more complicated loop.

Other things to consider:

  1. You can easily take advantage of two (or more) gpus by using data parallelism, which is basically a form of gradient accumulation.
    If you want to make use of ‘true’ parallelism, that is model parallelism, things get a lot more difficult. Note that all the advantages of NVlink, memory pooling, etc, are exploitable only with model parallelism. Also with data parallelism, if the network itself is big, you will end up with unequal memory occupation on the gpus. The first gpu will have to hold the network and its half of the minibatch.
  2. Going with two 3090s on air, in my experience, requires 4slot spacing, that is, you have to leave at least 2 free slot in between the gpus. Doable, but it severely restricts your choice about cpus and motherboards.
  3. Liquid cooled setups, in my opinion, are not to be left alone, powered on and unsupervised, for long periods of time. Basically, you multiply the potential failure points, and a leak or a pump failure are quite probable. Also, a high quality liquid cooled configuration (which mitigates the overheat problem but doesn’t solve it) costs a lot.
  4. We observed that cooling two 3090s with two 360mm rads was doable, but the machine was not exactly silent. In the end, the only way to have two 3090s operating on full steam and silence was by using a MoRa3 external rad (more headaches).
1 Like

Have you monitored VRAM temps under stress? It’s not doable under linux (API limitation) but it can be done on windows. Just install a a miner, start mining, and monitor the VRAM temp with hwinfo64.
A miner puts on the cards a kind of load very similar to typical DL training.

Perhaps there is hope to finally monitor VRAM temps in Linux. I’m not holding my breath but it would be nice to see this eventually make its way to nvtop.

Regarding VRAM temps, my 3090 il also watercooled and, so far, it only crashed once (watercooling pump failure :cold_face: ). Limiting it to 280w and adding an active fan on top of the backplate sucking air away seems to do the trick for me. This, of course, is less than ideal, but at least it works.

The interesting thing is that, at least on my case, I don’t see decreases in performance even after days of training 24/7 at full load. Time to complete an epoch are consistent, epoch after epoch.

That said, next time I upgrade the card I’ll definitely consider the professional series. Twice the VRAM and no need for custom water cooling loops are certainly convenient!

1 Like

That was my point… I am often accessing the machine remotely, even when I travel for weeks. I definitely need air.

Mh, consider that the VRAM reach the throttling temp within just 2 minutes, or even less than that. So either you don’t see any performance decrease because the vram has throttled and stays so, or the trick with the active fan is working properly (but much more difficult to implement with 2 cards… Impossible with one free slot separation).

… and the blower fan is substantially more silent than the one on the 3090 turbo…

1 Like

Yeah. I’m doing the same. I was really hoping the water-cooling solution would have been more build-and-forget but it clearly is not. Fans can also have failures, don’t get me wrong, but with a custom water-cooling loop, there is more that can go wrong.

To be fair, I don’t think the Eisstation I bought is a professional-grade product. The issue I had was more related to the reservoir, which all of a sudden started leaking. Unfortunately, I’m not the first person with this issue.

This is really good to (not) hear! :smile:

1 Like

I have not monitored them and I only have linux installed on the machine. I built this for a project that is still ongoing and I can’t bring it offline for a windows install/test right now. I do have a large case with and there are 2 full slots between the 3090’s with quite a few fans on the case so hopefully that will help. I replaced the from panel on the case which was glass w/ a 3D printed panel where I have one of the radiators mounted to make enough room for all of the radiators and hopefully also improve airflow to some extent.

Top 2 cards are 3090 hybrids and the bottom one is an extra 1080ti I had.

Have you seen factory ‘hybrid’ water cooled cards leak or just custom water cooling loops? I have been using hybrid coolers for years with no leaking issues, but I know i’m just one small data point. I have not personally heard of factory AIO hybrid card leaks.

With AIOs the probability of a leak are smaller (they are kinda ‘sealed’), but still nonzero. AIO gpus are a niche, but there are plenty of leak reports for the much more widespread cpu AIOs. For example, the premium, expensive EK aios are infamous for that reason. Another thing to take into account is the pump. For obvious reasons, the pump installed into AIOs are smaller and punier. While you could have a reasonable degree of realiability with a dual genuine D5 in a custom loop (e.g. EK revo dual), that’s not the case with AIOs.

Now, not to spoil the party, but regarding what you said above, I bet my arse that your VRAM is constantly throttling. Plenty of investigation has been done about that (various subreddits are full of details), but there is no way that in a 3090, be it liquid, air, or hybrid cooled, the vram doesn’t throttle, unless one takes specific steps (heatsinks on the backplate with active fans, repadding…) and even so, the trick works only at reduced power.
The only way to prevent it completely is by adopting a liquid-cooled active backplate.

Now that pissed me off a lot, but… Don’t forget that Nvidia sells the 3090 as a gaming card. And indeed in gaming the card behaves as expected.

1 Like

That’s another point… The hardware used in computer liquid cooling is basically recycled stuff from acquariums.
Even the best things (d5 pumps, etc…) come from that industry.

And they think that’s fine, for people buying them are just a bunch of gaming-addicted kids with flashing led strips.

The only example of liquid cooled professional rig is the DGX station, for which Nvidia had stuff specifically custom-built, and offers rock solid on-site warranty.

How much performance degradation (%) have you noticed after the card starts throttling?

Some good 30% on average, but it changed depending on how much throttling it required (winter… summer…) the power level, and the task at hand. Also remember that operating vrams at throttling temp should be done for limited timespans. For gddr6x, Micron declares they start taking damage at ~120C. Throttling temp is 110C.

1 Like

I was able to test out one of my 3090’s on another PC. I used nicehash, not sure if that’s the best for testing or not. It looks like my memory junction temps were bouncing between 102-104 after ~30 min of mining. I know that temp isn’t good, but it doesn’t seem to be at throttling temps. This is a stock EVGA 3090 hybrid 3988. Not sure if I’m interpreting this correctly but it does seem like it’s not throttling. I might try adding another fan on top of the backplate to see if that lowers it even more. I’m happy to run another test if you have any suggestions.

1 Like

They are not throttling, and that’s good, but note that you are operating them at a maximum of 106C, which is very close. Note also that vram occupation is modest, some 5 Gb. Try to fiddle with nicehash settings to see if you can achieve near-maximum vram occupation. Another way could be installing fastai/pytorch on windows and train a big transformer.
An active fan on the backplate could help. Try to put some 10C between throttling temp and your long-time operating temp. That is, less or equal than 100C.
Also reducing the operating power (which is very high, >400W) can help a lot.
Anyway, given that the power level is so high, the card is behaving better than your typical 3090… Probably evga employs good quality thermal pads, and a backplate with good thermal capacity.
But please test with near maximum vram occupation. Vram chips are ugly beasts. If a chip is not addressed by the controller, it doesn’t produce heat.

1 Like

I thought I’d post here about my 2x 3090 rig, since i’ve seen very few builds with 2x 3090 documented.

Case: Lian-Li O11 XL
MB: ASUS X299 Sage
CPU: i9 10940
Memory
GPUS: 2x3090-FE
PSU: Corsair AX1600i
CPU cooler: LianLi AIO

I selected this motherboard so that i could get 4-slot spacing for the 3090s, allowing a larger air gap between them, per Tim Dettmer’s blog.

The case, i guess is a splurge but also was great to have the space for cabling and circulation.

I had started this build using a 1200W Corsair. This did NOT work for 2 GPU training workloads. As soon as i would start certain types of training workloads, the machine would immediately power cycle. I could prevent power cycling by locking gpu clocks (nvidia-smi -lgc 1600) to something below peak clocks. 1600mhz worked well for me, YMMV. I didn’t have success with wattage limiting using nvidia-smi and i think the reason is because 3090 has power spikes even if you wattage limit, but by limiting the peak frequency, the spikes are reduced. On many training pipelines, this would result in a pretty minimal reduction in overall speed, but i wanted to be able to run the 3090’s flat out, so i migrated to the 1600W PSU which has solved the power problems.

After resolving the power problems, the system’s been a dream.

At flat out, the lower GPU maintains around 63C and upper GPU around 75C. I will probably invest in one more set of fans to the right of the GPUs to provide more cool intake. The lower fan set is pull and the upper AIO is push, the rear fan is push.

5 Likes

This is great. How much did it cost? Would love to see some performance benchmarks if you get some time to share. Cheers!

If you have time and will, check the vram’s temp on the upper one.