Local Server GPU Benchmarks

I want to stress that Jeremy recommends that students of the course, especially new students, to use one of the cloud options for running the notebooks. Again, the main reason for this is that they should work out of the box with minimal troubleshooting. By following his advice, you will spend more time obtaining knowledge of deep learning. However, if you are up to more troubleshooting and have a GPU on hand you may want to try a local server.

I have done a local server for several years in doing this course. 5 years ago, I built a server with an Nvidia 1080ti. At that time, it took a lot of configuring to get things working correctly. Software has evolved and the setup is pretty straightforward nowadays if you plan on using Ubuntu. Install the OS, install the nvidia drivers, install fastai, clone the repo, and you are up and running.

So how does a 5 year old 1080ti card perform? It can hold its own. For instance in the first lesson’s notebook, it will spend about 25 secs an epoch doing image classification on PETS. Once you get to text processing on IMDB, it will take almost 5 minutes an epoch (4:52) fine tuning once you reduce the batch size. Not too shabby if you ask me.

Earlier this year, I built a new machine with a 3080ti. So how does the recent generation stack up? Well first, let me say that I tried running this in native windows. With the problems Jupyter/ipython has with multi processing, this test was over before it started. I would NOT recommend trying to run fastai in a native windows environment. So what could you try on the “windows” front? WSL2!!!

I setup WSL2 similar to my post on the subject, but the link contents have been updated to make it a bit more straightforward. Basically if your windows is pretty recent, you can just follow nvidia’s steps. Once setup, I ran the same notebooks and found that the 3080ti would run the tests listed above in about half the time. PETS would take 11-12 secs per epoch, and IMDB in 2:23 per epoch.

Now, I do not know how long it takes the various cloud options to run the same tests. I only make this post so you can approximate the performance/time of GPUs for those tempted to create/run a local server. With GPU prices coming down in price from their mining peak, some may want to go down this route. You can then find your breakeven point for services that may charge after a certain usage time.

I hope this helps and I can’t wait to get back into DL!

5 Likes

Thank you for sharing this comparison. Does GPU performance on WSL2 is on the level of native Ubuntu these days? Some time ago I’ve read it’s faster on native Ubuntu. But either way it sound as a great option.

1 Like

I am sure that there is some performance degradation in WSL2 as with all hypervisor solutions. Since I virtualize everything except windows these days, I have no real way to test Ubuntu on bare metal. However the performance was pretty close in WSL2 when I tested it 2 years ago? and can only assume it has gotten better since then.

1 Like

Hi, your post made me curious :thinking: about how some of the GPUs available in the cloud platforms perform. So I quickly benchmarked the classification and IMDB Language models on these GPUs with and without mixed precision.

Image Classification

from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'

def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(224))

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)

## Using mixed-precision

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn = learn.to_fp16()
GPU Type Fine Tuning Full model Fp 16 - Fine Tuning Fp 16 - Full model
RTX5000 00:24 00:30 00:14 00:16
RTX6000 00:18 00:21 00:12 00:14
A5000 00:14 00:15 00:10 00:12
A6000 00:14 00:13 00:10 00:12
A100 00:12 00:10 00:10 00:10

Text classifier

dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn.fine_tune(1, 1e-2)

## Using mixed-precision
learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
learn = learn.to_fp16()
learn.fine_tune(1, 1e-2)
GPU Type Fine Tuning Full Model Fp 16 - Fine Tuning Fp 16 - Full model
RTX5000 01:48 03:23 00:49 01:26
RTX6000 01:17 02:25 00:41 01:08
A5000 00:53 01:32 00:36 01:02
A6000 00:45 01:17 00:34 00:58
A100 00:36 00:56 00:25 00:40

Increasing the Batch size

All the above experiments were conducted using default arguments. A simple tweak of increasing the batch size to 256 improves the results further :smile: by up to 100% on A100.

GPU Type Fine Tuning Full Model Fp 16 - Fine Tuning Fp 16 - Full model
A100 00:15 00:28 00:12 00:19

Note: Duration is in Minutes: Seconds format.

9 Likes

Thanks @VishnuSubramanian I just quickly ran IMDB on the 3080ti WSL2 setup and was able to get a drop from 2:23/epoch to 0:40/epoch with fp16 and bs=128. I run out of memory with bs=256.

the 1080ti will do fp16 as well and I will run that here in a bit, likely editing this post.
EDIT:

It seems that using fp16 and a bs=128, you can get performance out of a 1080ti to near a 3080ti using non fp16 and bs=32. Interesting results.

If anyone has code that can quickly use multiple GPUs (hugginghace acclerator) that they can share, I will try that as well.

2 Likes

I guess, with 1080ti you may not find performance gain with fp16 as they do not have specialized cores required for acceleration.

1 Like

I think the 1080ti was the only 10 series card that could do to_fp16(). I don’t think the the likes of the 1070 or 1060 can do that. While the 1080ti lacks the tensor cores of newer card generations, I am unsure if the fastai code is written to actually use those tensor cores. Can @jeremy , @sgugger , or @muellerzr chime in on that?

Looking at the source code here, fastai mixed precision uses PyTorch’s autocast and GradScaler. Thats how, we can see the improvement gains using fp16.

3 Likes

And autocast and gradscaler come from the amp package: Automatic Mixed Precision package - torch.cuda.amp — PyTorch 1.11.0 documentation. Thanks for doing the digging @VishnuSubramanian

For HF accelerate you can use this code here, which I wrote for a blog.

1 Like

I did some comparisons on Colab vs local 3090 in another thread. Here’s a link (Colab model learning speed - #2 by matdmiller) if you’re interested. Colab performance varied greatly depending on if you get assigned a K80 vs T4. The OP was trying to figure out why his Colab performance varied significantly from day to day. I am running locally on Ubuntu 18.04 w/ Docker for fast.ai on a machine I built about 5 years ago and recently upgraded w/ a 3090.

Thanks for this @FourMoBro and @VishnuSubramanian. I have been meaning to get around benchmarking my system I built on a 3090 a while back for DL. Will run this soon and update my results soon! :slight_smile:

1 Like

Please don’t at-mention the forum admins except for things that can only be addressed by those specific people (e.g. where some administrative issue needs to be addressed).

4 Likes

And just in case anyone’s interested, here are the numbers for a Xeon E5-2665 8C/16t @ 2.40GHz + 64GB DDR3 + 1070ti (Dell T3600). :sweat_smile:

BTW, It did not complain about to_fp16() even though it’s a 1070ti.

EDIT: fixed the earlier version with finetune (thanks @FourMoBro !) and added numbers for IMDB.

2 Likes

Yes, it will not complain. We will not be able to observe much boost in performance but it’s a handy trick if you want to increase the batch size.

2 Likes

Thanks for that tip! It actually improved the times quite a bit and I was able to double the batch size from 32 to 64 without blowing the VRAM on the 1070 ti

IMDB classifier fp16 - (bs=64)

epoch train_loss valid_loss accuracy time
0 0.464539 0.408723 0.813080 02:39
epoch train_loss valid_loss accuracy time
0 0.272241 0.224532 0.910240 05:28
2 Likes

Speaking of performance,
I’ve built a Tensor class that supports RAW files, but now every training takes 60 times more slower… Given that the learner gets the same tensor batches and shapes (but with more digits in it), I can only tell that the difference begins with the files’ size. A JPG file of 960x960 would weigh 900KB, while a RAW file of ~3000x~5000 would weigh 16MB.

Could this really be the reason why the learner class takes longer to finish epochs?

They can. I had a 1070. Always used it in fp16.
There should be an old post of mine (2017 or 2018) where I posted some benchmark.

I have an EVGA 1070ti and it definitely did not complain and I was able to double the batch size to 64 and get 30-40% better performance. I’m a little surprised a 3090 is only 4-5x faster but maybe it shines on bigger datasets. All those cuda cores don’t do much unless they’re fed properly I suppose :slight_smile:

1 Like

Only? :smiley:

Consider that even a 2X (100%) speedup has dramatic consequences when you have to train something substantial. E.g. the unsupervised phase for NLP on a big corpus, or even vision with a big network and hi-res images. Years ago I trained an efficientnet b7 on hi-res images for medical appliances… It took days on a DGX station. Time is money, and 4-5X faster is a very big deal.

But the other main point of those newer gpus is the amount of vram. If you gpu is slow, you just wait. If your model doesn’t fit in the vram, there is nothing one can do (except perhaps playing with gradient accumulation). There is kind of a threshold for the batch size, below which one fails to attain decent accuracies.

5 Likes