For those who run their own AI box, or want to

How it started:

Thanks to the help in this forum, I was able to setup an old GTX 1070

How it is going:

I managed to fine tuned a ‘vit_large_patch16_224’ using this GPU with only 8GB of memory

As you can see in the image I had to use a batch size of 4! and GradientAccumulation (thanks to Jeremy’s lesson 7 and live coding sessions).

Some additional observations

image

Overclocking your GPU is potentially risky

Power, Temperature and Time

vit_large_patch16_224

This large model took about 16:30 min per epoch.

Epoch 10 got 33:10. That happened because some models, at some times, wont train at full GPU capacity. But then it re-started using the full GPU again.

I would like to make some more (and more consistent) experiments comparing temperature, times and power consumption. by different limiting values to the power of the GPU. Limiting the power was (or is, I don’t know) a must for crypto mining. An effect can be observed in this model, but for what I have seen, for small models you can get even better drops in temperature (and consumption). There large models are eager in power.

The temperature on the GPU for this model is relatively high: about 62° / 64° C.

I limited the power to 90% at epoch 12/13 and you can see that after epoch 13, each epoch took about 30 seconds more to train. But it also limited the heat a little, which is good also for reducing fan noise.

image

In this plot it is clear that somehow a memory usage peak is what caused the GPU not being able to use all its power. At that point I stopped the measurement (what I regret), then, when the GPU started at full power again, I also started measuring again. It is clear the drops in temperature and power consumption.

You could count the epochs in those plots by looking at the patterns.

swinv2_base_window12_192_22k

Only to compare, here is a cooler model with power limited to 80%. I made another plot measure for this same model without limiting the power and it trained at about 2° C above the 55.74° observed in this.

resnet34

A resnet34 trains with little power consumption and very cool. I’m not even sure if the 80% power limit made a difference or not.

GPU Clock and Memory Clock

For crypto mining these values (in addition to limiting the power) are crucial to getting the most out of the GPU. Back in 2017 I was hesitant to overclock but it turned out that the GPU worked more efficient, cooler and with less noise by finding the appropriate tweak for each coin. Each coin has its own algorithm, and each one of them has its own GPU optimizations. Some algorithms were more power intensive or more unstable.

For what I have seen so far, with deep learning models is different, I haven’t found difference between different GPU or memory clocks. Like the effect I have seen with limiting the power.

Command for quering the data

The loop parameter set to 1 implies that the query is made at a 1 second frequency…

nvidia-smi --query-gpu=timestamp,power.draw,memory.total,utilization.memory,temperature.gpu,clocks.current.graphics,clocks.current.sm,clocks.current.memory --loop=1 --format=csv --filename=filename.csv
1 Like

A 1070ti is the only GPU I have, and will have for some time, it seems reasonable. And when epoch timings get annoyingly long, I try to run stuff on Jarvislabs etc, but for most quick explorations it’s still pretty decent and I don’t know enough about overclocking/undervolting etc enough to mess with it, the stock setup seems to work for me. They’re still usable and I get unlimited time with my home machine compared to the free GPUs on Kaggle (they seem comparable performance wise.)

2 Likes

In May I installed fastai 2.6.3 this way.

I did search but I’d prefer to ask here what’s the best way to update that fastai installation? THX

1 Like
$ mamba --help 

shows…

update - Updates conda packages to the latest compatible version.

so I presume the following would do it…

mamba update -c fastchan fastai nbdev

I’m not sure if the “-c fastchan” is required or the package manager remembers where it was originally installed from and defaults to update from there. I imagine the latter is true, but someone else will need to confirm that.

1 Like

It doesn’t, unfortunately.

1 Like

You can use the same commands to update as you used to install.

2 Likes

I’m on ubuntu 22.04.1 LTS
calling search_images_ddg() gives URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:1129)>

I read it’s something to do with SSL on ubuntu, but have no idea how to fix this

mamba list contains openssl 1.1.1q and pyopenssl 22.0.0

EDIT: hmmm, after disabling VPN, now I get to the loop inside search_images_ddg(), but get 403:Forbidden every time on data = urljson(requestUrl,data=params)