Local Server GPU Benchmarks

And autocast and gradscaler come from the amp package: Automatic Mixed Precision package - torch.cuda.amp — PyTorch 1.11.0 documentation. Thanks for doing the digging @VishnuSubramanian

For HF accelerate you can use this code here, which I wrote for a blog.

1 Like

I did some comparisons on Colab vs local 3090 in another thread. Here’s a link (Colab model learning speed - #2 by matdmiller) if you’re interested. Colab performance varied greatly depending on if you get assigned a K80 vs T4. The OP was trying to figure out why his Colab performance varied significantly from day to day. I am running locally on Ubuntu 18.04 w/ Docker for fast.ai on a machine I built about 5 years ago and recently upgraded w/ a 3090.

Thanks for this @FourMoBro and @VishnuSubramanian. I have been meaning to get around benchmarking my system I built on a 3090 a while back for DL. Will run this soon and update my results soon! :slight_smile:

1 Like

Please don’t at-mention the forum admins except for things that can only be addressed by those specific people (e.g. where some administrative issue needs to be addressed).

4 Likes

And just in case anyone’s interested, here are the numbers for a Xeon E5-2665 8C/16t @ 2.40GHz + 64GB DDR3 + 1070ti (Dell T3600). :sweat_smile:

BTW, It did not complain about to_fp16() even though it’s a 1070ti.

EDIT: fixed the earlier version with finetune (thanks @FourMoBro !) and added numbers for IMDB.

2 Likes

Yes, it will not complain. We will not be able to observe much boost in performance but it’s a handy trick if you want to increase the batch size.

2 Likes

Thanks for that tip! It actually improved the times quite a bit and I was able to double the batch size from 32 to 64 without blowing the VRAM on the 1070 ti

IMDB classifier fp16 - (bs=64)

epoch train_loss valid_loss accuracy time
0 0.464539 0.408723 0.813080 02:39
epoch train_loss valid_loss accuracy time
0 0.272241 0.224532 0.910240 05:28
2 Likes

Speaking of performance,
I’ve built a Tensor class that supports RAW files, but now every training takes 60 times more slower… Given that the learner gets the same tensor batches and shapes (but with more digits in it), I can only tell that the difference begins with the files’ size. A JPG file of 960x960 would weigh 900KB, while a RAW file of ~3000x~5000 would weigh 16MB.

Could this really be the reason why the learner class takes longer to finish epochs?

They can. I had a 1070. Always used it in fp16.
There should be an old post of mine (2017 or 2018) where I posted some benchmark.

I have an EVGA 1070ti and it definitely did not complain and I was able to double the batch size to 64 and get 30-40% better performance. I’m a little surprised a 3090 is only 4-5x faster but maybe it shines on bigger datasets. All those cuda cores don’t do much unless they’re fed properly I suppose :slight_smile:

1 Like

Only? :smiley:

Consider that even a 2X (100%) speedup has dramatic consequences when you have to train something substantial. E.g. the unsupervised phase for NLP on a big corpus, or even vision with a big network and hi-res images. Years ago I trained an efficientnet b7 on hi-res images for medical appliances… It took days on a DGX station. Time is money, and 4-5X faster is a very big deal.

But the other main point of those newer gpus is the amount of vram. If you gpu is slow, you just wait. If your model doesn’t fit in the vram, there is nothing one can do (except perhaps playing with gradient accumulation). There is kind of a threshold for the batch size, below which one fails to attain decent accuracies.

5 Likes

I think what I was trying to say was that I was surprised that a 1070ti was only 4-5x slower. :sweat_smile:

I guess I had some (rather naïve) notion in my head that a 3090 with “so many cores” would be ‘exponentially’ better … and so the thing that surprised me a bit was that the relationship is almost linear (~5x perf for ~5x cores … for not ~5x power draw actually)

But you are absolutely right, 4-5x is quite dramatic especially for the bigger jobs, and I think that’s where cards like 3090 (and soon 4090) shine with their faster, larger VRAM stores and the ability to transfer more data in less time off main RAM and disk storage.

4 Likes

I’m getting the following error running the text classifier code from above. I have updated to the latest version of Fastai locally(2.5.6) so I’m not sure what the issue is…Any thoughts on what this could be?

1 Like

It works for me when I do “from fastai.text.all import *” before creating the dataloader. Just importing the fastai doesn’t seem to do it for me either.

2 Likes

Thanks @mike.moloch. That worked :slight_smile:

2 Likes

Hey folks,
would you please run the Jeremy’s NLP starter notebook (kaggle) on your local server?
The training cell takes 57 seconds to complete on the a6000 (250W, linux) and 7:44 minutes to complete on the 2060 Super (WSL2).
I’m curious about your results.

I’m trying to run it locally and I set my creds and installed kaggle but I get an error . I noticed the notebook doesn’t import anything from the kaggle package. Do I need to import api or something?

      4 if not iskaggle and not path.exists():
----> 5     api.competition_download_cli(str(path))
      6     ZipFile(f'{path}.zip').extractall(path)

NameError: name 'api' is not defined

EDIT:

OK so I got it to work. The code in the notebook wasn’t creating the kaggle.json file properly (it was empty) and then I had to import kaggle in the cell where it downloads the data (where I got the error previously)

BUT

Now I’m getting
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")

AttributeError: 'SentencePieceProcessor' object has no attribute 'encode'
1 Like

check installed version of sentencepiece, i have 0.1.96 and it works
other package versions i have that may be related:

datasets                              1.18.4
huggingface-hub                       0.4.0
transformers                          4.16.2
2 Likes

Install the transformers package with pip and, if required, sentencepiece too :wink:

1 Like