The main thing that I noticed is that it’s difficult to properly cool two 3090s. While it’s relatively easy to make sure the core stays under the throttling temp, it’s almost impossible to do the same for the GDDR6X chips on the back… I also collaborated to a project with two liquid cooled 3090s… No effect. The vram on the back overheats all the same. Maybe with an active liquid cooled backplate it would be different, but at the price of a much more complicated loop.
Other things to consider:
You can easily take advantage of two (or more) gpus by using data parallelism, which is basically a form of gradient accumulation.
If you want to make use of ‘true’ parallelism, that is model parallelism, things get a lot more difficult. Note that all the advantages of NVlink, memory pooling, etc, are exploitable only with model parallelism. Also with data parallelism, if the network itself is big, you will end up with unequal memory occupation on the gpus. The first gpu will have to hold the network and its half of the minibatch.
Going with two 3090s on air, in my experience, requires 4slot spacing, that is, you have to leave at least 2 free slot in between the gpus. Doable, but it severely restricts your choice about cpus and motherboards.
Liquid cooled setups, in my opinion, are not to be left alone, powered on and unsupervised, for long periods of time. Basically, you multiply the potential failure points, and a leak or a pump failure are quite probable. Also, a high quality liquid cooled configuration (which mitigates the overheat problem but doesn’t solve it) costs a lot.
We observed that cooling two 3090s with two 360mm rads was doable, but the machine was not exactly silent. In the end, the only way to have two 3090s operating on full steam and silence was by using a MoRa3 external rad (more headaches).
Have you monitored VRAM temps under stress? It’s not doable under linux (API limitation) but it can be done on windows. Just install a a miner, start mining, and monitor the VRAM temp with hwinfo64.
A miner puts on the cards a kind of load very similar to typical DL training.
Perhaps there is hope to finally monitor VRAM temps in Linux. I’m not holding my breath but it would be nice to see this eventually make its way to nvtop.
Regarding VRAM temps, my 3090 il also watercooled and, so far, it only crashed once (watercooling pump failure ). Limiting it to 280w and adding an active fan on top of the backplate sucking air away seems to do the trick for me. This, of course, is less than ideal, but at least it works.
The interesting thing is that, at least on my case, I don’t see decreases in performance even after days of training 24/7 at full load. Time to complete an epoch are consistent, epoch after epoch.
That said, next time I upgrade the card I’ll definitely consider the professional series. Twice the VRAM and no need for custom water cooling loops are certainly convenient!
That was my point… I am often accessing the machine remotely, even when I travel for weeks. I definitely need air.
Mh, consider that the VRAM reach the throttling temp within just 2 minutes, or even less than that. So either you don’t see any performance decrease because the vram has throttled and stays so, or the trick with the active fan is working properly (but much more difficult to implement with 2 cards… Impossible with one free slot separation).
… and the blower fan is substantially more silent than the one on the 3090 turbo…
Yeah. I’m doing the same. I was really hoping the water-cooling solution would have been more build-and-forget but it clearly is not. Fans can also have failures, don’t get me wrong, but with a custom water-cooling loop, there is more that can go wrong.
To be fair, I don’t think the Eisstation I bought is a professional-grade product. The issue I had was more related to the reservoir, which all of a sudden started leaking. Unfortunately, I’m not the first person with this issue.
I have not monitored them and I only have linux installed on the machine. I built this for a project that is still ongoing and I can’t bring it offline for a windows install/test right now. I do have a large case with and there are 2 full slots between the 3090’s with quite a few fans on the case so hopefully that will help. I replaced the from panel on the case which was glass w/ a 3D printed panel where I have one of the radiators mounted to make enough room for all of the radiators and hopefully also improve airflow to some extent.
Top 2 cards are 3090 hybrids and the bottom one is an extra 1080ti I had.
Have you seen factory ‘hybrid’ water cooled cards leak or just custom water cooling loops? I have been using hybrid coolers for years with no leaking issues, but I know i’m just one small data point. I have not personally heard of factory AIO hybrid card leaks.
With AIOs the probability of a leak are smaller (they are kinda ‘sealed’), but still nonzero. AIO gpus are a niche, but there are plenty of leak reports for the much more widespread cpu AIOs. For example, the premium, expensive EK aios are infamous for that reason. Another thing to take into account is the pump. For obvious reasons, the pump installed into AIOs are smaller and punier. While you could have a reasonable degree of realiability with a dual genuine D5 in a custom loop (e.g. EK revo dual), that’s not the case with AIOs.
Now, not to spoil the party, but regarding what you said above, I bet my arse that your VRAM is constantly throttling. Plenty of investigation has been done about that (various subreddits are full of details), but there is no way that in a 3090, be it liquid, air, or hybrid cooled, the vram doesn’t throttle, unless one takes specific steps (heatsinks on the backplate with active fans, repadding…) and even so, the trick works only at reduced power.
The only way to prevent it completely is by adopting a liquid-cooled active backplate.
Now that pissed me off a lot, but… Don’t forget that Nvidia sells the 3090 as a gaming card. And indeed in gaming the card behaves as expected.
Some good 30% on average, but it changed depending on how much throttling it required (winter… summer…) the power level, and the task at hand. Also remember that operating vrams at throttling temp should be done for limited timespans. For gddr6x, Micron declares they start taking damage at ~120C. Throttling temp is 110C.
I was able to test out one of my 3090’s on another PC. I used nicehash, not sure if that’s the best for testing or not. It looks like my memory junction temps were bouncing between 102-104 after ~30 min of mining. I know that temp isn’t good, but it doesn’t seem to be at throttling temps. This is a stock EVGA 3090 hybrid 3988. Not sure if I’m interpreting this correctly but it does seem like it’s not throttling. I might try adding another fan on top of the backplate to see if that lowers it even more. I’m happy to run another test if you have any suggestions.
They are not throttling, and that’s good, but note that you are operating them at a maximum of 106C, which is very close. Note also that vram occupation is modest, some 5 Gb. Try to fiddle with nicehash settings to see if you can achieve near-maximum vram occupation. Another way could be installing fastai/pytorch on windows and train a big transformer.
An active fan on the backplate could help. Try to put some 10C between throttling temp and your long-time operating temp. That is, less or equal than 100C.
Also reducing the operating power (which is very high, >400W) can help a lot.
Anyway, given that the power level is so high, the card is behaving better than your typical 3090… Probably evga employs good quality thermal pads, and a backplate with good thermal capacity.
But please test with near maximum vram occupation. Vram chips are ugly beasts. If a chip is not addressed by the controller, it doesn’t produce heat.
I selected this motherboard so that i could get 4-slot spacing for the 3090s, allowing a larger air gap between them, per Tim Dettmer’s blog.
The case, i guess is a splurge but also was great to have the space for cabling and circulation.
I had started this build using a 1200W Corsair. This did NOT work for 2 GPU training workloads. As soon as i would start certain types of training workloads, the machine would immediately power cycle. I could prevent power cycling by locking gpu clocks (nvidia-smi -lgc 1600) to something below peak clocks. 1600mhz worked well for me, YMMV. I didn’t have success with wattage limiting using nvidia-smi and i think the reason is because 3090 has power spikes even if you wattage limit, but by limiting the peak frequency, the spikes are reduced. On many training pipelines, this would result in a pretty minimal reduction in overall speed, but i wanted to be able to run the 3090’s flat out, so i migrated to the 1600W PSU which has solved the power problems.
After resolving the power problems, the system’s been a dream.
At flat out, the lower GPU maintains around 63C and upper GPU around 75C. I will probably invest in one more set of fans to the right of the GPUs to provide more cool intake. The lower fan set is pull and the upper AIO is push, the rear fan is push.
I was not able to successfully get fastai working on Windows. When I installed it following the fastai instructions (with some tweaks) I was getting a warning that the cuda version that is installed does not work with the 3090. I have made several different attempts at this with different techniques with no luck. Below is the steps i followed in the latest attempt and I recorded everything I did this time.
If you have any other suggestions for benchmarking I’m happy to give them a shot. I was not able to find any settings in NiceHash to increase the Ram utilization from the last temperature test I ran. I have not tried installing WSL2 and going down that rabbit hole, and I’m trying to avoid that to keep my windows install lightweight as it’s not what I use for real AI work anyways.
Sorry for the late response, I didn’t get any notification for this post.
Trust me, WSL2 won’t mess with your windows installation, apart from occupying some disk space. I’d encourage you to install it, as it’s quite straightfoward to work with (no rabbit hole should happen).
When you don’t need it, just leave it shut off.
I just wanted to drop in here and let people know that tons of miners are either adding Copper Plates or Re-Pasting/Re-Padding their GPUs (or both).
The reason being is that there are tons of cards with crap thermal paste/pads. coolmygpu.com is a solid choice for copper plates (which can give a significant reduction in heat) (up to 34% heat reduction on VRAM)
As for creating a system that can handle multiple GPUs. I have to say that it is definitely a challenge. One that I myself run into. (With 4x 3080TI FEs, its hard to find something that is a turnkey solution)
Also, i see a lot of people mentioning the performance difference between the 3080 and the 3090. And you might want to dig into some nvidia spec sheets. NVIDIA RTX 30-series
The 3080 10gb model is only 320-bit memory interface/bandwidth w/ 272 Tensor cores. Meaning it can only handle about 720GB/s in memory bandwidth.
Vs the 3080 12gb and 3080TI 12GB model which both have 384-bit memory interface/bandwidth (900GB/s).
For reference the
3080 has 8704/8960 CUDA + 272 Tensor cores and 68 SMs
3080Ti has 10240 CUDA + 320 Tensor cores and 80 SMs
3090 has 10496 CUDA 328 Tensor Cores and 82 SMs
3090Ti has 10752 CUDA + 336 Tensor Cores and 84 SMs