How it started:
Thanks to the help in this forum, I was able to setup an old GTX 1070
How it is going:
I managed to fine tuned a ‘vit_large_patch16_224’ using this GPU with only 8GB of memory
As you can see in the image I had to use a batch size of 4! and GradientAccumulation (thanks to Jeremy’s lesson 7 and live coding sessions).
Some additional observations
Overclocking your GPU is potentially risky
Power, Temperature and Time
vit_large_patch16_224
This large model took about 16:30 min per epoch.
Epoch 10 got 33:10. That happened because some models, at some times, wont train at full GPU capacity. But then it re-started using the full GPU again.
I would like to make some more (and more consistent) experiments comparing temperature, times and power consumption. by different limiting values to the power of the GPU. Limiting the power was (or is, I don’t know) a must for crypto mining. An effect can be observed in this model, but for what I have seen, for small models you can get even better drops in temperature (and consumption). There large models are eager in power.
The temperature on the GPU for this model is relatively high: about 62° / 64° C.
I limited the power to 90% at epoch 12/13 and you can see that after epoch 13, each epoch took about 30 seconds more to train. But it also limited the heat a little, which is good also for reducing fan noise.
In this plot it is clear that somehow a memory usage peak is what caused the GPU not being able to use all its power. At that point I stopped the measurement (what I regret), then, when the GPU started at full power again, I also started measuring again. It is clear the drops in temperature and power consumption.
You could count the epochs in those plots by looking at the patterns.
swinv2_base_window12_192_22k
Only to compare, here is a cooler model with power limited to 80%. I made another plot measure for this same model without limiting the power and it trained at about 2° C above the 55.74° observed in this.
resnet34
A resnet34 trains with little power consumption and very cool. I’m not even sure if the 80% power limit made a difference or not.
GPU Clock and Memory Clock
For crypto mining these values (in addition to limiting the power) are crucial to getting the most out of the GPU. Back in 2017 I was hesitant to overclock but it turned out that the GPU worked more efficient, cooler and with less noise by finding the appropriate tweak for each coin. Each coin has its own algorithm, and each one of them has its own GPU optimizations. Some algorithms were more power intensive or more unstable.
For what I have seen so far, with deep learning models is different, I haven’t found difference between different GPU or memory clocks. Like the effect I have seen with limiting the power.
Command for quering the data
The loop parameter set to 1 implies that the query is made at a 1 second frequency…
nvidia-smi --query-gpu=timestamp,power.draw,memory.total,utilization.memory,temperature.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.memory --loop=1 --format=csv --filename=filename.csv