Comparision between .to_fp16() and .to_fp32() with MNIST_SAMPLE on RTX 2070


(hogan) #1

Comparision between .to_fp16() and .to_fp32() with MNIST_SAMPLE

Test environment:

Firstly, run the following command to get the software and hardware info.

from fastai.utils.collect_env import *
show_install(1)
​```text
=== Software === 
python        : 3.6.6
fastai        : 1.0.39
fastprogress  : 0.1.18
torch         : 1.0.0
nvidia driver : 410.93
torch cuda    : 10.0.130 / is available
torch cudnn   : 7401 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7951MB | GeForce RTX 2070

=== Environment === 
platform      : Linux-4.15.0-43-generic-x86_64-with-debian-stretch-sid
distro        : Ubuntu 16.04 Xenial Xerus
conda env     : step
python        : /home/hogan/anaconda3/envs/step/bin/python
sys.path      : /home/hogan/anaconda3/envs/step/lib/python36.zip
/home/hogan/anaconda3/envs/step/lib/python3.6
/home/hogan/anaconda3/envs/step/lib/python3.6/lib-dynload

/home/hogan/anaconda3/envs/step/lib/python3.6/site-packages
/home/hogan/anaconda3/envs/step/lib/python3.6/site-packages/IPython/extensions
/home/hogan/.ipython

Thu Jan 24 20:55:14 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93       Driver Version: 410.93       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:05:00.0  On |                  N/A |
| 31%   30C    P8    16W / 185W |    394MiB /  7951MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1596      G   /usr/lib/xorg/Xorg                           167MiB |
|    0      4521      G                                                 13MiB |
|    0      4925      G   /usr/bin/gnome-shell                         125MiB |
|    0      8110      G   ...AAAAAAAAAIAAAAAAAAAAgAAAAAAAAA --servic    42MiB |
|    0     16638      G   ...-token=543973B623488F4E2BC9F280625EB8CA    33MiB |
+-----------------------------------------------------------------------------+

​```

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

train MNIST_SAMPLE with .to_fp16()

  1. Run following codes in jupyter notebook.
from fastai import *
from fastai.vision import *
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
model = simple_cnn((3,16,16,2))
learn = Learner(data, model, metrics=[accuracy]).to_fp16()
learn.fit_one_cycle(5)
for p in model.parameters():
    print(p.type())

Total time: 00:18

epoch train_loss valid_loss accuracy
1 0.202413 0.136330 0.949460
2 0.102659 0.092405 0.970559
3 0.077279 0.069442 0.974975
4 0.064258 0.059099 0.979392
5 0.062370 0.058341 0.978901
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor

The Gpu memory and memory usage are as follows:

  • GPU Memory 725M
  • Memory 2862M
  • Total time: 00:18
  1. Run kernel-> Restart, then test for .to_fp32()

train MNIST_SAMPLE with .to_fp32()

from fastai import *
from fastai.vision import *
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
model = simple_cnn((3,16,16,2))
learn = Learner(data, model, metrics=[accuracy])
learn.fit_one_cycle(5)  
for p in model.parameters():
    print(p.type())

Total time: 00:18

epoch train_loss valid_loss accuracy
1 0.205327 0.134014 0.951423
2 0.094773 0.072822 0.974975
3 0.066074 0.059597 0.978410
4 0.049261 0.045921 0.984789
5 0.045165 0.045733 0.985280
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor

The Gpu memory and memory usage are as follows:

  • GPU Memory 723M
  • Memory 2868M
  • Total time: 00:18

Conclusion

Dataset: MNIST_SAMPLE

method GPU memory Memory Total time
to_fp16() 725M 2862M 00:18
to_fp32() 723M 2868M 00:18

So, is three anything wrong? I can not improve the training from neither GPU memory usage nor time…


How to install RTX enabled fastai? (CUDA10)
(Willismar Medeiros) #2

In my understanding you would need to compare pure software based before trying to compare a chain of software like fastai that is over pytorch that is over cudnn and etc.

Also in my understanding the fact of the change from fp32 to fp16 would double the space of the supposed space memory for your computations, leading to a lower bandwidth to finish some job on the GPU.

What version of pytorch are you using: night build (1.1) of release (1.0) ? Another thing your setup are using Lapack or not ?


(Fabrizio) #3

to_fp16 works fine! Try again but between the two runs restart the kernel or set the learner to none and call the garbage collector. But likely the dataset is too small and you cannot see any difference IMO. Better try on bigger dataset and see what is going to happen.


(mmm wax) #4

Jeremy just retweeted a link (credit: Sanyam Bhutani) that shows the differences between FP16/32 on various Resnet models. Indeed, the smaller ones show no benefit at all using MPT, while the larger ones can shave off up to a third of the training time. Of course, I wish that number were closer to half, but at least it’s something. Here’s the link:

Edit: I realize the article is more about the differences between the 1080ti and 2080ti, but for my purposes the 16/32 differences are much more interesting. Noteworthy is that even the 1080ti shows about a 20% improvement on FP16, despite the 1/64 crippling.


(Sanyam Bhutani) #5

Thanks for the mention @crayoneater
I had intended to keep it both as a comparision as well as a highlight of MPT.

I had noticed a consistent 1.8x batch_size for all of the resnets, however-the speedups were shown in the later ones/“deeper ones”.

If you have any questions that I might be able to answer, I’ll try my best :slight_smile:


(Andrea de Luca) #6

@init_27, look at this:

I wonder why this happens. The 1080ti should be heavily crippled (1/32) as it comes to FP16. Still, it achieves a speedup around ~20%. It’s far from 2080ti’s 35%, but still substantial.

Furthermore, what about memory occupation on 1080ti running in mixed precision?

Thanks!


(Sanyam Bhutani) #7

GTX Cards do have FP16 support however, they aren’t optimized for FP16. Source: Kaggle noobs slack discussion.

The memory consumption and GPU utilisation were absolutely similar.

Please let me know if you have any other questions.
Thanks!


(Andrea de Luca) #8

Thanks, Sanyam.

As far as I understand, the fp16 perf in Pascal cards is 1/32 of their fp32 performance (at least, this is what they declare in Pascal’s specs). As soon as they do the fp16 part of mixed precision training, their performance should decrease dramatically, and, consequently, their overall performance wrt pure fp32.

See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

And:

https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/5

Moreover, 1080ti’s memory consumption should not be the same in fp16 and fp32, because no matter the performance, all the stuff occupies less memory (or maybe you meant 'memory consumption were absolutely similar bewteen the two cards? But then how comes that 1080ti didn’t manage to run r152?)

Thanks in advance!
A.


(Sanyam Bhutani) #9

Thanks for sharing.

I’m surprised too now :open_mouth:

Quite a rudimentary experiment but, I increased the batch_size until the mem was full up to the brim, even with an increase of batch_size by 4 would cause OOM. The same was the case for the 1080Ti and no increase in bs was possible.

The 152 test wasn’t done because my friend couldn’t run it due to time constraints.

However, I’m interested in running more tests given what you’ve shared. Any suggestions/ideas that you want to run, I’ll try checking the same as well.

Regards,
Sanyam.


(Thomas) #10

I would be really nice to compare against the Titan Cards, Nvidia is claiming that the RTX2080ti is crippled at 0.5x performance in mixed precision training. Apparently it is not true.


I really would like to see a mixed precision training comparison between RTX cards, actually for the price of 1 Titan you get 2x 2080Tis.


(mmm wax) #11

Hi Thomas, the RTX 20xx half-speed crippling in MPT is only on the 32-bit accumulate step. The actual FP16 part runs at full speed. Also, the RTX Titan does not have this crippling, only the 20xx cards.


(Thomas) #12

Yes I know, I have an RTX 2080Ti, and it is fast! But have not tested against the RTX titan.