GPU Cooling Solution

Hi all,

I have my own Deep Learning box setup with Ubuntu 16.04 / Win 10 Dual boot with a single 1080 Ti.

I noticed that my GPU fans do not automatically blow stronger even as my GPU temperatures approach 85 and higher. From what I read, keeping temperature around 70C will help preserve the longevity of the cards. Furthermore, the GPU will automatically throttle its speed when temperatures get too high.

After some research, I’ve come across this guide which was helpful to me in being able to manually control my GPU fan output. I typically set it to 70-80% when running iterations.

Hope it helps someone else too!

P.S I used the solution for the setup with a monitor, but it also provides a solution for headless GPUs.

Ian

6 Likes

If you’re still in the market for a GPU, one option is to buy a GPU that is water-cooled such as the EVGA Hybrid series. I’ve got a 1080 Ti and it hardly ever goes over ~40 C while training (and I’ve never even seen the fan spin up).

1 Like

GPU pricing has gone bonkers in the last month as the crypto-miners are buying them all up. Newegg is asking $1600 for a hybrid EVGA 1080ti. I think they were around $900 when I was looking at them in December. The 1060 6GB card I bought as a temporary solution at $289 is now $549. I thought prices might drop after Christmas, but apparently my pricing forecast model was missing a feature or two. :face_with_raised_eyebrow:

Yes, prices on GPU went crazy since October.
But water cooled GPU is very convenient - no noise, no throttling. Previously it was about the same costs as to put water block on GPU on your own or to buy Hybrid EVGA (was around 120$ price difference). Now it seems cheaper to buy water block separately. But you have to install it precisely, otherwise with such amount of heating power - local overheating will cause throttling and may be damage anyway.

On other side you will have so much control over temperature with water cooling.

My record so far 9C with 100% utilization on 1080Ti (funny that below 0C nvidia-smi shows temperature error, they never accounted probably that GPU would be cooled below 0C):

1 Like

wow 9C?? thats crazy :slight_smile:

Is this with an EVGA or did you install your own water cooling? My EVGA is 17C when it’s idle, which is about room temperature. When running at 100% it tops out at 38C.

No, thats fully custom, except EVGA water block.
Water based (with a little bit propylene glycol) open loop with 20 liter exchange reservoir, and heat sink placed outside the house. Freezing temperatures outside helps to cool it close to 0C.

1 Like

BTW: what is a good temperature for the GPU under full utilization ?

Mine (1070) remains around 71°-72° C when under >90% Utilization… this should be okay i guess?

I believe that is fine. The target max temp to keep ur GPU at from what I read is approx 70- 75 C in order to keep the longevity of the card.

1 Like

So I’ve been having some serious thermal throttling issues with my 1080 ti’s.
Note this was under 99% utilization, driver persisted,
After some googling and troubleshooting…

  • Thermal shutdown I’ve heard 96 C and 105 C
  • Thermal throttling 94 C
  • Fans turn on at 60 C which is the stock default. Supposedly this can be changed using nvidia-settings

What I’ve observed…

  • Around 80-83 C I see the power consumption cycle between 115 and 235 watts
  • Time per epoch goes from 27sec to 31sec and then oscillates between 27 and 32 as the power consumption varies.
  • The default fan curve for the card is pretty tame and does little to prevent a thermal runaway.
  • Even with all the gpus dumping heat into the case my cpu with an AIO gets to at most 47 C.
  • The temps reported above were taken after about 12.5 minutes into a 2.25 hour run.

My setup is 4 x 1080 ti 11gb Gigabyte gaming oc on an Asus zenith extreme with a Threadripper 1950x. The high temps are just unacceptable. My plan is to water cool the cpu and the gpus to try to get the temps down to 70-75 max.

I don’t understand how lambda labs can sell a quad gpu system in a corsair air 540 case and expect the gpus to not have the same problems that I’m having. Given their reputation I suspect I’m missing something major.

Around 80-83 C I see the power consumption cycle between 115 and 235 watts

That’s expected behavior. When the card is cool enough (<83c) and has more power available to use (under its watt limit) the card will clock up and go faster. Once it hits a thermal limit or power consumption limit it will slow back down for a bit before repeating this cycle.

The default fan curve for the card is pretty tame and does little to prevent a thermal runaway

Yes, the default curve is great for gaming but if you’re maxing the card out it’s not so good. And personally I prefer longevity over noise output.

I solve this by using a tool called “NVIDIA Profile Inspector”.
I use it to:

  1. Manually set the fan speed from auto to a constant 85%
  2. Lower my power and temperature targets from 100% / 83c to 80% / 75c
  3. Up my base clock and memory clock rates a bit (which is totally optional and depends on your card)

These cards are clocked on the factory to be stable across all the different batches of chips, your card will likely still run just as quickly with a bit less power. Tutorials on cryptomining can help you with this too.

1 Like

Are you setting the fan speed using nvidia-settings in linux? If not, this is the first problem to address. See https://devtalk.nvidia.com/default/topic/1003810/linux/adjust-nvidia-gpu-fan-speed-multiple-gpus-one-monitor-/

The following sudo commands will enable fan speed adjustment:
nvidia-xconfig --enable-all-gpus
nvidia-xconfig --cool-bits=4

You then need to manually adjust the fan speed. Unfortunately as far as I know there is no “Nvidia profile inspector” for linux.

Here’s an example for your .bashrc that will 1) increase fan speed to 85% or 2) reduce fan speed to default. As always, test and use at your own risk:

alias nvidia-fanup=‘export DISPLAY=:0; nvidia-settings -a GPUFanControlState=1; nvidia-settings -a GPUTargetFanSpeed=85’
alias nvidia-fandown=‘export DISPLAY=:0; nvidia-settings -a GPUFanControlState=0’

If you are running a 4 GPU system, I would also suggest removing the backplates of the 3 lower cards. The Pascal FE cards have a two-part backplate specifically for this purpose. This gives the fans a little more room to breath and can lead to a few degrees C of improvement.

1 Like

Unfortunately they’re not FE cards, no backplate. I will try the commands in your comment.

Do your cards have blower fans (the ones that exhaust hot air out the back)? This is the kind you want for a multigpu setup with air cooling.

Edit to add link for comparison of blower vs nonblower in multigpu setups:

They aren’t. I’ve gotten the parts for a water cooling setup. I’ll update with new thermals once I’ve had time to run the same benchmarks.

In retrospect I should’ve purchased FE cards or some blower type card. I’m not entirely clear as to why FE cards are less desirable, from what I’ve read online folks have suggested not getting FE cards.

Wow, 4x open air Gtx 1080ti GPU’s in one box. You must have some serious case fans to even keep them at mid 80 degrees. I was using 2 x Gigabyte 1080ti’s (one AORUS Xtreme, one Gaming OC) in one machine and even with ~6cm between PCIe slots the upper GPU was hitting mid 90 degrees.

I went down the long and difficult road (for a first timer) of watercooling as I wrote up here: https://medium.com/@bingobee01/watercooling-a-deep-learning-machine-46608f6acfee

As your cards are all the same model, and there is an EK waterblock that will fit these (see uppermost GPU in build pics in link above), you should find it easier than I did routing the watercooling fittings for the middle two GPU’s.