Comparision between .to_fp16() and .to_fp32() with MNIST_SAMPLE on RTX 2070

hogan · January 24, 2019, 1:15pm

Comparision between .to_fp16() and .to_fp32() with MNIST_SAMPLE

Test environment:

Firstly, run the following command to get the software and hardware info.

from fastai.utils.collect_env import *
show_install(1)

```text
=== Software === 
python        : 3.6.6
fastai        : 1.0.39
fastprogress  : 0.1.18
torch         : 1.0.0
nvidia driver : 410.93
torch cuda    : 10.0.130 / is available
torch cudnn   : 7401 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7951MB | GeForce RTX 2070

=== Environment === 
platform      : Linux-4.15.0-43-generic-x86_64-with-debian-stretch-sid
distro        : Ubuntu 16.04 Xenial Xerus
conda env     : step
python        : /home/hogan/anaconda3/envs/step/bin/python
sys.path      : /home/hogan/anaconda3/envs/step/lib/python36.zip
/home/hogan/anaconda3/envs/step/lib/python3.6
/home/hogan/anaconda3/envs/step/lib/python3.6/lib-dynload

/home/hogan/anaconda3/envs/step/lib/python3.6/site-packages
/home/hogan/anaconda3/envs/step/lib/python3.6/site-packages/IPython/extensions
/home/hogan/.ipython

Thu Jan 24 20:55:14 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93       Driver Version: 410.93       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:05:00.0  On |                  N/A |
| 31%   30C    P8    16W / 185W |    394MiB /  7951MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1596      G   /usr/lib/xorg/Xorg                           167MiB |
|    0      4521      G                                                 13MiB |
|    0      4925      G   /usr/bin/gnome-shell                         125MiB |
|    0      8110      G   ...AAAAAAAAAIAAAAAAAAAAgAAAAAAAAA --servic    42MiB |
|    0     16638      G   ...-token=543973B623488F4E2BC9F280625EB8CA    33MiB |
+-----------------------------------------------------------------------------+

```

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

train MNIST_SAMPLE with .to_fp16()

Run following codes in jupyter notebook.

from fastai import *
from fastai.vision import *
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
model = simple_cnn((3,16,16,2))
learn = Learner(data, model, metrics=[accuracy]).to_fp16()
learn.fit_one_cycle(5)
for p in model.parameters():
    print(p.type())

Total time: 00:18

epoch	train_loss	valid_loss	accuracy
1	0.202413	0.136330	0.949460
2	0.102659	0.092405	0.970559
3	0.077279	0.069442	0.974975
4	0.064258	0.059099	0.979392
5	0.062370	0.058341	0.978901

torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor
torch.cuda.HalfTensor

The Gpu memory and memory usage are as follows:

GPU Memory 725M
Memory 2862M
Total time: 00:18

Run kernel-> Restart, then test for .to_fp32()

train MNIST_SAMPLE with .to_fp32()

from fastai import *
from fastai.vision import *
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
model = simple_cnn((3,16,16,2))
learn = Learner(data, model, metrics=[accuracy])
learn.fit_one_cycle(5)  
for p in model.parameters():
    print(p.type())

Total time: 00:18

epoch	train_loss	valid_loss	accuracy
1	0.205327	0.134014	0.951423
2	0.094773	0.072822	0.974975
3	0.066074	0.059597	0.978410
4	0.049261	0.045921	0.984789
5	0.045165	0.045733	0.985280

torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor
torch.cuda.FloatTensor

The Gpu memory and memory usage are as follows:

GPU Memory 723M
Memory 2868M
Total time: 00:18

Conclusion

Dataset: MNIST_SAMPLE

method	GPU memory	Memory	Total time
to_fp16()	725M	2862M	00:18
to_fp32()	723M	2868M	00:18

So, is three anything wrong? I can not improve the training from neither GPU memory usage nor time…

willismar · January 24, 2019, 2:35pm

In my understanding you would need to compare pure software based before trying to compare a chain of software like fastai that is over pytorch that is over cudnn and etc.

Also in my understanding the fact of the change from fp32 to fp16 would double the space of the supposed space memory for your computations, leading to a lower bandwidth to finish some job on the GPU.

What version of pytorch are you using: night build (1.1) of release (1.0) ? Another thing your setup are using Lapack or not ?

fabris · January 25, 2019, 12:09am

to_fp16 works fine! Try again but between the two runs restart the kernel or set the learner to none and call the garbage collector. But likely the dataset is too small and you cannot see any difference IMO. Better try on bigger dataset and see what is going to happen.

crayoneater · January 29, 2019, 12:56am

Jeremy just retweeted a link (credit: Sanyam Bhutani) that shows the differences between FP16/32 on various Resnet models. Indeed, the smaller ones show no benefit at all using MPT, while the larger ones can shave off up to a third of the training time. Of course, I wish that number were closer to half, but at least it’s something. Here’s the link:

Edit: I realize the article is more about the differences between the 1080ti and 2080ti, but for my purposes the 16/32 differences are much more interesting. Noteworthy is that even the 1080ti shows about a 20% improvement on FP16, despite the 1/64 crippling.

init_27 · January 29, 2019, 3:48am

Thanks for the mention @crayoneater
I had intended to keep it both as a comparision as well as a highlight of MPT.

I had noticed a consistent 1.8x batch_size for all of the resnets, however-the speedups were shown in the later ones/“deeper ones”.

If you have any questions that I might be able to answer, I’ll try my best

balnazzar · January 30, 2019, 2:55pm

@init_27, look at this:

I wonder why this happens. The 1080ti should be heavily crippled (1/32) as it comes to FP16. Still, it achieves a speedup around ~20%. It’s far from 2080ti’s 35%, but still substantial.

Furthermore, what about memory occupation on 1080ti running in mixed precision?

Thanks!

init_27 · January 30, 2019, 7:05pm

GTX Cards do have FP16 support however, they aren’t optimized for FP16. Source: Kaggle noobs slack discussion.

The memory consumption and GPU utilisation were absolutely similar.

Please let me know if you have any other questions.
Thanks!

balnazzar · January 30, 2019, 9:08pm

Thanks, Sanyam.

As far as I understand, the fp16 perf in Pascal cards is 1/32 of their fp32 performance (at least, this is what they declare in Pascal’s specs). As soon as they do the fp16 part of mixed precision training, their performance should decrease dramatically, and, consequently, their overall performance wrt pure fp32.

See: CUDA C++ Programming Guide

And:

Moreover, 1080ti’s memory consumption should not be the same in fp16 and fp32, because no matter the performance, all the stuff occupies less memory (or maybe you meant 'memory consumption were absolutely similar bewteen the two cards? But then how comes that 1080ti didn’t manage to run r152?)

Thanks in advance!
A.

init_27 · January 31, 2019, 6:33am

Thanks for sharing.

I’m surprised too now

Quite a rudimentary experiment but, I increased the batch_size until the mem was full up to the brim, even with an increase of batch_size by 4 would cause OOM. The same was the case for the 1080Ti and no increase in bs was possible.

The 152 test wasn’t done because my friend couldn’t run it due to time constraints.

However, I’m interested in running more tests given what you’ve shared. Any suggestions/ideas that you want to run, I’ll try checking the same as well.

Regards,
Sanyam.

tcapelle · February 11, 2019, 10:40am

I would be really nice to compare against the Titan Cards, Nvidia is claiming that the RTX2080ti is crippled at 0.5x performance in mixed precision training. Apparently it is not true.

[Album] Imgur

I really would like to see a mixed precision training comparison between RTX cards, actually for the price of 1 Titan you get 2x 2080Tis.

crayoneater · February 22, 2019, 2:04am

Hi Thomas, the RTX 20xx half-speed crippling in MPT is only on the 32-bit accumulate step. The actual FP16 part runs at full speed. Also, the RTX Titan does not have this crippling, only the 20xx cards.

tcapelle · February 22, 2019, 8:32am

Yes I know, I have an RTX 2080Ti, and it is fast! But have not tested against the RTX titan.

martin_sl · April 7, 2019, 1:45am

Hi,

I am new to the fast.ai library. I ran some experiments with .to_fp16() on a resnet50. I have a GeForce RTX 2080. I am only seeing an improvement of 6.5% in training time. I was hoping for something more substantial. I am wondering if I did something wrong.

My code looks like this:

from fastai.vision import *
from fastai.metrics import error_rate
from fastai.callbacks import *
from fastai.utils.collect_env import *
from fastai.utils.mem import *
import time

show_install(0)

tfms = get_transforms(do_flip = False)
data = ImageDataBunch.from_folder(path, valid_pct = 0.20, ds_tfms = tfms, size = (176, 320), bs = 64)
data.normalize()

learn = cnn_learner(data, models.resnet50, metrics = error_rate).to_fp16()

t0 = time.perf_counter()

n = 5
learn.fit_one_cycle(n, max_lr = 0.001)
learn.unfreeze()
learn.fit_one_cycle(n, max_lr = slice(1e-6, 1e-4))

t1 = time.perf_counter()
print(‘Done in {:.2f} seconds.’.format(t1 - t0))

My fp32 output is:

=== Software === 
python        : 3.7.2
fastai        : 1.0.50.post1
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 418.56
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7949MB | GeForce RTX 2080

=== Environment === 
platform      : Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
distro        : #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
conda env     : Unknown
python        : /opt/conda/bin/python
sys.path      : /data
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload
/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

Optional package(s) to enhance the diagnostics can be installed with:
pip install distro
Once installed, re-run this utility to get the additional information
epoch train_loss valid_loss error_rate time
0 0.701275 0.268657 0.082833 00:50
1 0.338086 0.236618 0.069628 00:49
2 0.244279 0.185638 0.057023 00:50
3 0.193370 0.175866 0.055822 00:50
4 0.165301 0.179734 0.055222 00:50
Total time: 04:10
epoch train_loss valid_loss error_rate time
0 0.164791 0.169567 0.051621 00:55
1 0.145533 0.162993 0.051621 00:54
2 0.140403 0.162715 0.052221 00:55
3 0.117760 0.156770 0.048019 00:54
4 0.108653 0.154306 0.045618 00:55
Total time: 04:35
Done in 526.51 seconds.

My fp16 output is:

Fri Apr 5 03:06:01 UTC 2019

=== Software === 
python        : 3.7.2
fastai        : 1.0.50.post1
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 418.56
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7949MB | GeForce RTX 2080

=== Environment === 
platform      : Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
distro        : #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
conda env     : Unknown
python        : /opt/conda/bin/python
sys.path      : /data
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload
/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

Optional package(s) to enhance the diagnostics can be installed with:
pip install distro
Once installed, re-run this utility to get the additional information
epoch train_loss valid_loss error_rate time
0 0.704658 0.298031 0.089436 00:49
1 0.344443 0.238623 0.072629 00:48
2 0.242520 0.206433 0.063625 00:48
3 0.193340 0.179305 0.055222 00:48
4 0.163160 0.184396 0.054022 00:48
Total time: 04:04
epoch train_loss valid_loss error_rate time
0 0.156445 0.173964 0.051020 00:49
1 0.151170 0.173405 0.052221 00:49
2 0.132254 0.168956 0.049820 00:49
3 0.110813 0.164140 0.048019 00:49
4 0.111009 0.167894 0.049220 00:49
Total time: 04:07
Done in 492.46 seconds.

balnazzar · April 7, 2019, 3:11pm

Interesting. Note that the speedup is rather marginal (~10%), while the convergence is retarded (0.154 vs 0.167).

I’m still not fully convinced about fp16. Don’t know whether the culprit is an immature nvidia driver (driver/cuda/cudnn) or the library (either pytorch or fastai). Or both.

martin_sl · April 8, 2019, 1:02am

By the way, I am using nvidia-smi to track the memory usage. With .to_fp16, it’s 31% lower than with fp32.

nvidia-smi --query-gpu=timestamp,utilization.memory,memory.free,memory.used --format=csv -l 5

fp32: 7905 (peak) - 589 (baseline) = 7316 MB
fp16: 5618 (peak) - 600 (baseline) = 5018 MB

martin_sl · April 8, 2019, 2:00am

One additional data point. I increased the number of pixels per image by 4X by doubling the height and width. I also reduced the batch size by 4X, from 64 to 16, to avoid running out of GPU memory.

With the larger images, fp16 is 31% faster.

FP32:

=== Software === 
python        : 3.7.3
fastai        : 1.0.51
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 418.56
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7918MB | GeForce RTX 2080

=== Environment === 
platform      : Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
distro        : #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
conda env     : Unknown
python        : /opt/conda/bin/python
sys.path      : /data
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload
/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

Optional package(s) to enhance the diagnostics can be installed with:
pip install distro
Once installed, re-run this utility to get the additional information
epoch train_loss valid_loss error_rate time
0 0.419564 0.327143 0.104442 02:07
1 0.288966 0.208411 0.063625 02:07
2 0.163462 0.169172 0.049220 02:08
3 0.154311 0.142385 0.041417 02:09
4 0.142003 0.134394 0.042017 02:09
Total time: 10:43
epoch train_loss valid_loss error_rate time
0 0.103146 0.137136 0.041417 02:54
1 0.108219 0.140815 0.040216 02:53
2 0.099841 0.127497 0.034814 02:54
3 0.071965 0.130131 0.037215 02:54
4 0.074660 0.126940 0.035414 02:54
Total time: 14:31
Done in 1515.05 seconds.

FP16:

=== Software === 
python        : 3.7.3
fastai        : 1.0.51
fastprogress  : 0.1.20
torch         : 1.0.1.post2
nvidia driver : 418.56
torch cuda    : 10.0.130 / is available
torch cudnn   : 7402 / is enabled

=== Hardware === 
nvidia gpus   : 1
torch devices : 1
  - gpu0      : 7918MB | GeForce RTX 2080

=== Environment === 
platform      : Linux-4.15.0-47-generic-x86_64-with-debian-stretch-sid
distro        : #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
conda env     : Unknown
python        : /opt/conda/bin/python
sys.path      : /data
/opt/conda/lib/python37.zip
/opt/conda/lib/python3.7
/opt/conda/lib/python3.7/lib-dynload
/opt/conda/lib/python3.7/site-packages
/opt/conda/lib/python3.7/site-packages/IPython/extensions

Please make sure to include opening/closing ``` when you paste into forums/github to make the reports appear formatted as code sections.

Optional package(s) to enhance the diagnostics can be installed with:
pip install distro
Once installed, re-run this utility to get the additional information
epoch train_loss valid_loss error_rate time
0 0.422790 0.324509 0.080432 01:33
1 0.265328 0.188858 0.056423 01:32
2 0.179923 0.173154 0.049820 01:32
3 0.139162 0.139924 0.040816 01:32
4 0.139520 0.135217 0.038415 01:32
Total time: 07:42
epoch train_loss valid_loss error_rate time
0 0.124400 0.127679 0.040816 01:56
1 0.116763 0.128562 0.040816 01:56
2 0.105893 0.126331 0.043217 01:56
3 0.060227 0.123574 0.039616 01:56
4 0.048516 0.122354 0.039616 01:56
Total time: 09:42
Done in 1044.84 seconds.

balnazzar · April 8, 2019, 1:21pm

But this scarce, with respect to 10X advertised by Nvidia, or 3X reported by Jeremy.

aksg87 · August 20, 2019, 8:41pm

Does anyone know if the TITAN Xp GPU supports to_fp16() optimizations? Thanks!

redturtle · August 21, 2019, 11:06am

No, Titan xp does not have tensor cores.

aksg87 · August 23, 2019, 5:37am

Thanks! Looking to getting another GPU in the near future. Will try it out then.