Mixed precision training

(Charl P. Botha) #21

I’m doing mixed precision training daily on my RTX 2070.

I just ran into some overflow issues using fastai’s fp16 support, but I’ve hacked the code a bit to use NVIDIA’s Apex amp for its dynamic loss scaling, and now this se-resnext50 is training without issues (so far) all in the 8GB of VRAM on the 2070.

fp16 / TensorCore support in CUDA10 and PyTorch means I can train much larger networks on this card than would be otherwise possible. In my more informal testing, there’s also a speed boost going from fp32 to fp16.

2 Likes

(Thomas) #22

Are this new RTX fast?

0 Likes

(Shiv Gowda) #23

For one of my workloads RTX 2070 is around 5-10% slower than 1080 Ti and around 50% slower than V100. In terms of price per value, I think it is the best Deep Learning GPU right now in the market.

Thanks @sgugger for the fp16 enhancement, I am able to train much larger models now(previously I could not train on 512x512 image because I could fit in only a batch size of 8, now I can train that dataset with a batch size of 56). There was a batch norm bug(most likely in pytorch) which prevented me from using it last week, but in the new PyTorch 1.0 release the bug seems to have been cleared and I can clearly see the value of this feature.

2 Likes

(Mohamed Hassan Elabasiri) #24

I am using fast.ai and Pytorch 1.0 and its working fine on my GTX 980ti.

0 Likes

(Andrea de Luca) #25

@cpbotha, would you share your apex integration with pytorch/fastai?

Apart from that, I’m trying to get mixed precision training working on a tesla V100 (cuda 10, drivers 415, fastai 1.0.42). No success my losses are always NaN since the very first epoch.

Thanks.

1 Like

(Charl P. Botha) #26

Your request motivated me to quickly write up the procedure (it was on my todo list, just hadn’t gotten around to it). Read here https://vxlabs.com/2019/02/04/improving-fastais-mixed-precision-support-with-nvidias-automatic-mixed-precision/ and let me know if you find any issues. (note the epsilon fix!)

7 Likes

(Andrea de Luca) #27

Thanks Charl! :slight_smile: But which nvcc version did you use? I get the following error while trying to compile apex:

/usr/local/cuda/include/crt/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 6 are not supported!
 #error -- unsupported GNU version! gcc versions later than 6 are not supported!
  ^~~~~
error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1

My nvcc is release 9.0, V9.0.176
Thanks!

0 Likes

(Charl P. Botha) #28

I did this with the CUDA10 builds of pytorch.

0 Likes

(Andrea de Luca) #29

Me too. I did open an issued on at apex’ github…!

0 Likes

(Charl P. Botha) #30

In your comment, you mention nvcc release 9.0?

On my side, I see this:

$ /usr/local/cuda-10.0/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
1 Like

(Michael) #31

I’m seeing really bad performance drops with fastai to_fp16 on their lesson 3 planet notebook. I am using an RTX 2070 with 8GB of VRAM and CUDA 10. I am seeing a halved f-score on with the same code except for the to_fp16 bit on that resnet-50 model. I’ve tried to use vanilla pytorch with Nvidia’s apex to train a resnet-34 with fp16. It gave me validation accuracies within 0.1%, which means mixed precision should work fine on my setup. I’m wondering what fastai has done wrong here.

0 Likes

(Andrea de Luca) #32

Were you able to compile Apex with cuda extension? If yes, which version of gcc and nvcc do you have? Thanks.

It could be the absence of loss scaling. Not sure, though.

0 Likes

(Michael) #33

I’ve done some more testing with loss scaling of 128, 1024, the default 512 and dynamic loss scaling without success. I opened an issue here. I wonder if fastai offers some way of printing out gradients while training for debugging purposes.

1 Like

#34

Like I said on the issue, I didn’t manage to reproduce. Note that you shouldn’t pass any loss_scale but use dynamic loss scaling as it works better.

1 Like

(Andrea de Luca) #35

So, dynamic loss scaling is actually implemented. Thanks.

For me, FP16 works rather well (vanilla fastai), but convergence is a bit delayed w.r.t. FP32 or wrt a fastai env in which apex is also installed. Tested on tesla v100 and 1080ti.

0 Likes

#36

Yes, it’s actually the default in the Callback but not the to_fp16 function I just realized. Just fixed it in master so it’s the default for everywhere now.

Note that you will see a few iterations with no training because of the way dynamic loss scaling works: it starts with a really high scale that is divided by 2 as long as your overflow.

2 Likes

(Andrea de Luca) #37

Perfect, now it should work without any delay. I was experimenting such delay since I always used to_fp16.

Thanks!

0 Likes

(Michael) #38

Oh, btw. For Apex, I’m using gcc 8.2.1 and cuda 10. They work perfectly.

1 Like

(Michael) #39

So, the bug is in lr_find(), and the dev build from master branch now should have no issues anymore. Here’s the issue that refers to the bug: https://github.com/fastai/fastai/issues/1903

Thanks @sgugger for fixing it.

1 Like

#40

I have tested fp16 using

learn = language_model_learner(data_lm, TransformerXL).to_fp16()
learn = language_model_learner(data_lm, TransformerXL).to_fp16(dynamic=False)    
learn = language_model_learner(data_lm, TransformerXL)

using 1000 rows train, 100 valid
fp16,dynamic=True, time = 04:23
fp16,dynamic=False, time = 04:20
no fp16 time = 00:51

Why it is slower in fp16 mode?
Thanks!

0 Likes