How do I tell if my GPU is being used? I have a hunch that it isn’t currently for a few reasons:
The fan isn’t starting on it.
The model is going really slow.
I can’t find a way to verify my assumption though. I looked at nvidia-smi and I don’t see python on the list which I think I would see if it were being used?
I am having troubles with this still, I don’t appear to have the torch.cuda.get_device_name(0) command. I tried specifically importing it, but I get the following error when I try to use it.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-16-35e676d96af5> in <module>()
----> 1 torch.cuda.get_device_name(0)
AttributeError: module 'torch.cuda' has no attribute 'get_device_name'
And the other thing I tried was the following line:
from torch.cuda import get_device_name
This also did not work.
All of the other torch.cuda commands seem to be working though which seems strange.
I am pretty confident that I do. I am able to use the nvidia-smi command with no issues and I actually now that I reran nvidia-smi I do see python so I’m wondering if it’s using the gpu, but the fans just aren’t running because of the colder environment?
I get this from nvcc --version I am just going to call it good unless it seems like things are still running slowly.
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61
1080 Ti, it has been great so far. The best thing about it is that I don’t have any excuses for having poor performances because of my machine so I know it’s my fault if there is an issue.
For a couple of months after purchase of 1080Ti (which I had got shipped from US), I was unaware of the fact that the fan was in a non-working condition out of the box. I too was under the impression that the fan would kick in only if the GPU felt the need, so never checked it.
On testing after a couple of months, found that the fan would not trigger in any condition, and the GPU would trip if temperature exceeded 90C. Getting it replaced would have cost me 1/6th to 1/5th the price of the GPU due to shipping and customs duty (I had to bear the cost of sending it to them and paying the octroi on return - this is EVGA Taiwan).
As a hack, I opened up the top acrylic cover to expose the internal heat sinks. Then set up an arrangement of 3 fans (2 blowers and 1 exhaust) for cooling. Looks badass now.
Considering your GPU is not really used I find your temperatures being really high. For instance here is a capture of nvidia-smi of my GTX 1080Ti SCII from EVGA:
With my ambient temperature being around 15C.
I don’t know what model of GTX 1080Ti you have but keep in mind that the higher voltage you set (either by default of manually) on your card will inevitably get you higher temps.
I don’t know how to retrieve the core voltage of the GPU on linux but here are the results of the power draw in watt from the nvidia-smi -i 0 -q -d POWER command:
==============NVSMI LOG==============
Timestamp : Wed Nov 8 16:38:18 2017
Driver Version : 384.59
Attached GPUs : 1
GPU 00000000:01:00.0
Power Readings
Power Management : Supported
Power Draw : 10.46 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
Power Samples
Duration : 116.01 sec
Number of Samples : 119
Max : 11.23 W
Min : 9.20 W
Avg : 10.14 W
Also when the card reaches a threshold temperature (which you can get with nvidia-smi -i 0 -q -d TEMPERATURE) your performances will inevitably decrease as the GPU will automatically downclock itself to not reach the “Shutdown temperature”. Here are my results:
~ ➜ nvidia-smi -i 0 -q -d TEMPERATURE
==============NVSMI LOG==============
Timestamp : Wed Nov 8 16:41:47 2017
Driver Version : 384.59
Attached GPUs : 1
GPU 00000000:01:00.0
Temperature
GPU Current Temp : 26 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
Memory Temp : N/A
I already stress tested my card (which is the first thing I do when I receive a new one) and I never went above 80C even during summer when ambient temperatures were around 30C.
What you can do instead is replacing the cooling block on your card by a watercooling one. Like this one. Very clever finding this hacky solution btw but I don’t think it will hold if you use your GPU at 100% for days (well you tell me aha ).
The deterrent was that I saw videos of how these are fit. It involves opening up the GPU (50 to 80 screws, big small and tiny) and exposing the chip. Given that there are no local service centers in my country, this was a very risky proposition. Very high chance of making mistakes
This solution holds to 85C most of the time, though I am making use of the AWS credits first
Oh okay, sad to hear that But 50 to 80 screws? That looks insane. Usually when you unmount a GPU you have 3 parts:
The board
The “radiators” (the metal grid)
The plastic mount (the thing on top of the metal grid)
You only need to unmount the part between the board and the meta grid to put a new one. See here (I’ve set the right timer). But well if your solution works for you then kudos for that . I’ll keep it in mind for when my own card will die.
Btw just a random thought: Maybe your fans aren’t spinning because the power connector between your GPU board and your plastic mount (where you have the fans) aren’t properly connected? (you can see on the video link above what I’m speaking about). Did you check that you used the right power stripes from your power supply on your card? Having 1 fan not working is ok, but having the 2 not working looks like a “power shortage” issue.
Yes connection can be a problem. Unfortunately the one I have (EVGA 1080 Ti FE) has a blower style fan, and the connection goes inside the GPU. I will have to open all the 50 screws to check this, unfortunately.
So probably I will take a chance once this gets a bit old