Comparision between .to_fp16() and .to_fp32() with MNIST_SAMPLE on RTX 2070

Comparision between .to_fp16() and .to_fp32() with MNIST_SAMPLE

Test environment:

Firstly, run the following command to get the software and hardware info.

from fastai.utils.collect_env import *
show_install(1)
​```text
=== Software === 
python        : 3.6.6
fastai        : 1.0.39
fastprogress  : 0.1.18
torch         : 1.0.0
nvidia driver : 410.93
torch cuda    : 10.0.130 / is available
torch cudnn   : 7401 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7951MB | GeForce RTX 2070

=== Environment === 
platform      : Linux-4.15.0-43-generic-x86_64-with-debian-stretch-sid
distro        : Ubuntu 16.04 Xenial Xerus
conda env     : step
python        : /home/hogan/anaconda3/envs/step/bin/python
sys.path      : /home/hogan/anaconda3/envs/step/lib/python36.zip
/home/hogan/anaconda3/envs/step/lib/python3.6
/home/hogan/anaconda3/envs/step/lib/python3.6/lib-dynload

/home/hogan/anaconda3/envs/step/lib/python3.6/site-packages
/home/hogan/anaconda3/envs/step/lib/python3.6/site-packages/IPython/extensions
/home/hogan/.ipython

Thu Jan 24 20:55:14 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93       Driver Version: 410.93       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:05:00.0  On |                  N/A |
| 31%   30C    P8    16W / 185W |    394MiB /  7951MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1596      G   /usr/lib/xorg/Xorg                           167MiB |
|    0      4521      G                                                 13MiB |
|    0      4925      G   /usr/bin/gnome-shell                         125MiB |
|    0      8110      G   ...AAAAAAAAAIAAAAAAAAAAgAAAAAAAAA --servic    42MiB |
|    0     16638      G   ...-token=543973B623488F4E2BC9F280625EB8CA    33MiB |
+-----------------------------------------------------------------------------+

​```

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

train MNIST_SAMPLE with .to_fp16()

  1. Run following codes in jupyter notebook.
from fastai import *
from fastai.vision import *
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
model = simple_cnn((3,16,16,2))
learn = Learner(data, model, metrics=[accuracy]).to_fp16()
learn.fit_one_cycle(5)
for p in model.parameters():
    print(p.type())

Total time: 00:18

epoch train_loss valid_loss accuracy
1 0.202413 0.136330 0.949460
2 0.102659 0.092405 0.970559
3 0.077279 0.069442 0.974975
4 0.064258 0.059099 0.979392
5 0.062370 0.058341 0.978901
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor

The Gpu memory and memory usage are as follows:

  • GPU Memory 725M
  • Memory 2862M
  • Total time: 00:18
  1. Run kernel-> Restart, then test for .to_fp32()

train MNIST_SAMPLE with .to_fp32()

from fastai import *
from fastai.vision import *
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
model = simple_cnn((3,16,16,2))
learn = Learner(data, model, metrics=[accuracy])
learn.fit_one_cycle(5)  
for p in model.parameters():
    print(p.type())

Total time: 00:18

epoch train_loss valid_loss accuracy
1 0.205327 0.134014 0.951423
2 0.094773 0.072822 0.974975
3 0.066074 0.059597 0.978410
4 0.049261 0.045921 0.984789
5 0.045165 0.045733 0.985280
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor

The Gpu memory and memory usage are as follows:

  • GPU Memory 723M
  • Memory 2868M
  • Total time: 00:18

Conclusion

Dataset: MNIST_SAMPLE

method GPU memory Memory Total time
to_fp16() 725M 2862M 00:18
to_fp32() 723M 2868M 00:18

So, is three anything wrong? I can not improve the training from neither GPU memory usage nor time…

3 Likes

In my understanding you would need to compare pure software based before trying to compare a chain of software like fastai that is over pytorch that is over cudnn and etc.

Also in my understanding the fact of the change from fp32 to fp16 would double the space of the supposed space memory for your computations, leading to a lower bandwidth to finish some job on the GPU.

What version of pytorch are you using: night build (1.1) of release (1.0) ? Another thing your setup are using Lapack or not ?

to_fp16 works fine! Try again but between the two runs restart the kernel or set the learner to none and call the garbage collector. But likely the dataset is too small and you cannot see any difference IMO. Better try on bigger dataset and see what is going to happen.

1 Like

Jeremy just retweeted a link (credit: Sanyam Bhutani) that shows the differences between FP16/32 on various Resnet models. Indeed, the smaller ones show no benefit at all using MPT, while the larger ones can shave off up to a third of the training time. Of course, I wish that number were closer to half, but at least it’s something. Here’s the link:

Edit: I realize the article is more about the differences between the 1080ti and 2080ti, but for my purposes the 16/32 differences are much more interesting. Noteworthy is that even the 1080ti shows about a 20% improvement on FP16, despite the 1/64 crippling.

5 Likes

Thanks for the mention @crayoneater
I had intended to keep it both as a comparision as well as a highlight of MPT.

I had noticed a consistent 1.8x batch_size for all of the resnets, however-the speedups were shown in the later ones/“deeper ones”.

If you have any questions that I might be able to answer, I’ll try my best :slight_smile:

1 Like

@init_27, look at this:

I wonder why this happens. The 1080ti should be heavily crippled (1/32) as it comes to FP16. Still, it achieves a speedup around ~20%. It’s far from 2080ti’s 35%, but still substantial.

Furthermore, what about memory occupation on 1080ti running in mixed precision?

Thanks!

2 Likes

GTX Cards do have FP16 support however, they aren’t optimized for FP16. Source: Kaggle noobs slack discussion.

The memory consumption and GPU utilisation were absolutely similar.

Please let me know if you have any other questions.
Thanks!

1 Like

Thanks, Sanyam.

As far as I understand, the fp16 perf in Pascal cards is 1/32 of their fp32 performance (at least, this is what they declare in Pascal’s specs). As soon as they do the fp16 part of mixed precision training, their performance should decrease dramatically, and, consequently, their overall performance wrt pure fp32.

See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

And:

https://www.anandtech.com/show/10325/the-nvidia-geforce-gtx-1080-and-1070-founders-edition-review/5

Moreover, 1080ti’s memory consumption should not be the same in fp16 and fp32, because no matter the performance, all the stuff occupies less memory (or maybe you meant 'memory consumption were absolutely similar bewteen the two cards? But then how comes that 1080ti didn’t manage to run r152?)

Thanks in advance!
A.

Thanks for sharing.

I’m surprised too now :open_mouth:

Quite a rudimentary experiment but, I increased the batch_size until the mem was full up to the brim, even with an increase of batch_size by 4 would cause OOM. The same was the case for the 1080Ti and no increase in bs was possible.

The 152 test wasn’t done because my friend couldn’t run it due to time constraints.

However, I’m interested in running more tests given what you’ve shared. Any suggestions/ideas that you want to run, I’ll try checking the same as well.

Regards,
Sanyam.

1 Like

I would be really nice to compare against the Titan Cards, Nvidia is claiming that the RTX2080ti is crippled at 0.5x performance in mixed precision training. Apparently it is not true.


I really would like to see a mixed precision training comparison between RTX cards, actually for the price of 1 Titan you get 2x 2080Tis.

Hi Thomas, the RTX 20xx half-speed crippling in MPT is only on the 32-bit accumulate step. The actual FP16 part runs at full speed. Also, the RTX Titan does not have this crippling, only the 20xx cards.

Yes I know, I have an RTX 2080Ti, and it is fast! But have not tested against the RTX titan.

Hi,

I am new to the fast.ai library. I ran some experiments with .to_fp16() on a resnet50. I have a GeForce RTX 2080. I am only seeing an improvement of 6.5% in training time. I was hoping for something more substantial. I am wondering if I did something wrong.

My code looks like this:

from fastai.vision import *
from fastai.metrics import error_rate
from fastai.callbacks import *
from fastai.utils.collect_env import *
from fastai.utils.mem import *
import time

show_install(0)

tfms = get_transforms(do_flip = False)
data = ImageDataBunch.from_folder(path, valid_pct = 0.20, ds_tfms = tfms, size = (176, 320), bs = 64)
data.normalize()

learn = cnn_learner(data, models.resnet50, metrics = error_rate).to_fp16()

t0 = time.perf_counter()

n = 5
learn.fit_one_cycle(n, max_lr = 0.001)
learn.unfreeze()
learn.fit_one_cycle(n, max_lr = slice(1e-6, 1e-4))

t1 = time.perf_counter()
print(‘Done in {:.2f} seconds.’.format(t1 - t0))

My fp32 output is:

=== Software === 
python        : 3.7.2
fastai        : 1.0.50.post1
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 418.56
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7949MB | GeForce RTX 2080

=== Environment === 
platform      : Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
distro        : #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
conda env     : Unknown
python        : /opt/conda/bin/python
sys.path      : /data
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload
/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

Optional package(s) to enhance the diagnostics can be installed with:
pip install distro
Once installed, re-run this utility to get the additional information
epoch train_loss valid_loss error_rate time
0 0.701275 0.268657 0.082833 00:50
1 0.338086 0.236618 0.069628 00:49
2 0.244279 0.185638 0.057023 00:50
3 0.193370 0.175866 0.055822 00:50
4 0.165301 0.179734 0.055222 00:50
Total time: 04:10
epoch train_loss valid_loss error_rate time
0 0.164791 0.169567 0.051621 00:55
1 0.145533 0.162993 0.051621 00:54
2 0.140403 0.162715 0.052221 00:55
3 0.117760 0.156770 0.048019 00:54
4 0.108653 0.154306 0.045618 00:55
Total time: 04:35
Done in 526.51 seconds.

My fp16 output is:

Fri Apr 5 03:06:01 UTC 2019

=== Software === 
python        : 3.7.2
fastai        : 1.0.50.post1
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 418.56
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7949MB | GeForce RTX 2080

=== Environment === 
platform      : Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
distro        : #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
conda env     : Unknown
python        : /opt/conda/bin/python
sys.path      : /data
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload
/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

Optional package(s) to enhance the diagnostics can be installed with:
pip install distro
Once installed, re-run this utility to get the additional information
epoch train_loss valid_loss error_rate time
0 0.704658 0.298031 0.089436 00:49
1 0.344443 0.238623 0.072629 00:48
2 0.242520 0.206433 0.063625 00:48
3 0.193340 0.179305 0.055222 00:48
4 0.163160 0.184396 0.054022 00:48
Total time: 04:04
epoch train_loss valid_loss error_rate time
0 0.156445 0.173964 0.051020 00:49
1 0.151170 0.173405 0.052221 00:49
2 0.132254 0.168956 0.049820 00:49
3 0.110813 0.164140 0.048019 00:49
4 0.111009 0.167894 0.049220 00:49
Total time: 04:07
Done in 492.46 seconds.

Interesting. Note that the speedup is rather marginal (~10%), while the convergence is retarded (0.154 vs 0.167).

I’m still not fully convinced about fp16. Don’t know whether the culprit is an immature nvidia driver (driver/cuda/cudnn) or the library (either pytorch or fastai). Or both.

By the way, I am using nvidia-smi to track the memory usage. With .to_fp16, it’s 31% lower than with fp32.

nvidia-smi --query-gpu=timestamp,utilization.memory,memory.free,memory.used --format=csv -l 5

fp32: 7905 (peak) - 589 (baseline) = 7316 MB
fp16: 5618 (peak) - 600 (baseline) = 5018 MB

One additional data point. I increased the number of pixels per image by 4X by doubling the height and width. I also reduced the batch size by 4X, from 64 to 16, to avoid running out of GPU memory.

With the larger images, fp16 is 31% faster.

FP32:

=== Software === 
python        : 3.7.3
fastai        : 1.0.51
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 418.56
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7918MB | GeForce RTX 2080

=== Environment === 
platform      : Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
distro        : #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
conda env     : Unknown
python        : /opt/conda/bin/python
sys.path      : /data
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload
/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

Optional package(s) to enhance the diagnostics can be installed with:
pip install distro
Once installed, re-run this utility to get the additional information
epoch train_loss valid_loss error_rate time
0 0.419564 0.327143 0.104442 02:07
1 0.288966 0.208411 0.063625 02:07
2 0.163462 0.169172 0.049220 02:08
3 0.154311 0.142385 0.041417 02:09
4 0.142003 0.134394 0.042017 02:09
Total time: 10:43
epoch train_loss valid_loss error_rate time
0 0.103146 0.137136 0.041417 02:54
1 0.108219 0.140815 0.040216 02:53
2 0.099841 0.127497 0.034814 02:54
3 0.071965 0.130131 0.037215 02:54
4 0.074660 0.126940 0.035414 02:54
Total time: 14:31
Done in 1515.05 seconds.

FP16:

=== Software === 
python        : 3.7.3
fastai        : 1.0.51
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 418.56
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7918MB | GeForce RTX 2080

=== Environment === 
platform      : Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
distro        : #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
conda env     : Unknown
python        : /opt/conda/bin/python
sys.path      : /data
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload
/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

Optional package(s) to enhance the diagnostics can be installed with:
pip install distro
Once installed, re-run this utility to get the additional information
epoch train_loss valid_loss error_rate time
0 0.422790 0.324509 0.080432 01:33
1 0.265328 0.188858 0.056423 01:32
2 0.179923 0.173154 0.049820 01:32
3 0.139162 0.139924 0.040816 01:32
4 0.139520 0.135217 0.038415 01:32
Total time: 07:42
epoch train_loss valid_loss error_rate time
0 0.124400 0.127679 0.040816 01:56
1 0.116763 0.128562 0.040816 01:56
2 0.105893 0.126331 0.043217 01:56
3 0.060227 0.123574 0.039616 01:56
4 0.048516 0.122354 0.039616 01:56
Total time: 09:42
Done in 1044.84 seconds.

But this scarce, with respect to 10X advertised by Nvidia, or 3X reported by Jeremy.

Does anyone know if the TITAN Xp GPU supports to_fp16() optimizations? Thanks!

No, Titan xp does not have tensor cores.

1 Like

Thanks! Looking to getting another GPU in the near future. Will try it out then.