Mixed precision training

cpbotha · November 27, 2018, 7:35am

I’m doing mixed precision training daily on my RTX 2070.

I just ran into some overflow issues using fastai’s fp16 support, but I’ve hacked the code a bit to use NVIDIA’s Apex amp for its dynamic loss scaling, and now this se-resnext50 is training without issues (so far) all in the 8GB of VRAM on the 2070.

fp16 / TensorCore support in CUDA10 and PyTorch means I can train much larger networks on this card than would be otherwise possible. In my more informal testing, there’s also a speed boost going from fp32 to fp16.

tcapelle · December 1, 2018, 7:36pm

Are this new RTX fast?

sgowda · December 12, 2018, 3:47am

For one of my workloads RTX 2070 is around 5-10% slower than 1080 Ti and around 50% slower than V100. In terms of price per value, I think it is the best Deep Learning GPU right now in the market.

Thanks @sgugger for the fp16 enhancement, I am able to train much larger models now(previously I could not train on 512x512 image because I could fit in only a batch size of 8, now I can train that dataset with a batch size of 56). There was a batch norm bug(most likely in pytorch) which prevented me from using it last week, but in the new PyTorch 1.0 release the bug seems to have been cleared and I can clearly see the value of this feature.

mElabasiri · December 23, 2018, 10:21am

I am using fast.ai and Pytorch 1.0 and its working fine on my GTX 980ti.

balnazzar · February 3, 2019, 1:22pm

@cpbotha, would you share your apex integration with pytorch/fastai?

Apart from that, I’m trying to get mixed precision training working on a tesla V100 (cuda 10, drivers 415, fastai 1.0.42). No success my losses are always NaN since the very first epoch.

Thanks.

cpbotha · February 4, 2019, 7:47am

Your request motivated me to quickly write up the procedure (it was on my todo list, just hadn’t gotten around to it). Read here Improving fastai's mixed precision support with NVIDIA's Automatic Mixed Precision. - vxlabs and let me know if you find any issues. (note the epsilon fix!)

balnazzar · February 5, 2019, 7:03am

Thanks Charl! But which nvcc version did you use? I get the following error while trying to compile apex:

/usr/local/cuda/include/crt/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 6 are not supported!
 #error -- unsupported GNU version! gcc versions later than 6 are not supported!
  ^~~~~
error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1

My nvcc is release 9.0, V9.0.176
Thanks!

cpbotha · February 27, 2019, 3:59am

I did this with the CUDA10 builds of pytorch.

balnazzar · February 27, 2019, 7:38pm

Me too. I did open an issued on at apex’ github…!

cpbotha · February 28, 2019, 6:43am

In your comment, you mention nvcc release 9.0?

On my side, I see this:

$ /usr/local/cuda-10.0/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

tydlwav · March 30, 2019, 12:54am

I’m seeing really bad performance drops with fastai to_fp16 on their lesson 3 planet notebook. I am using an RTX 2070 with 8GB of VRAM and CUDA 10. I am seeing a halved f-score on with the same code except for the to_fp16 bit on that resnet-50 model. I’ve tried to use vanilla pytorch with Nvidia’s apex to train a resnet-34 with fp16. It gave me validation accuracies within 0.1%, which means mixed precision should work fine on my setup. I’m wondering what fastai has done wrong here.

balnazzar · March 31, 2019, 10:20am

Were you able to compile Apex with cuda extension? If yes, which version of gcc and nvcc do you have? Thanks.

It could be the absence of loss scaling. Not sure, though.

tydlwav · March 31, 2019, 2:28pm

I’ve done some more testing with loss scaling of 128, 1024, the default 512 and dynamic loss scaling without success. I opened an issue here. I wonder if fastai offers some way of printing out gradients while training for debugging purposes.

sgugger · March 31, 2019, 2:53pm

Like I said on the issue, I didn’t manage to reproduce. Note that you shouldn’t pass any loss_scale but use dynamic loss scaling as it works better.

balnazzar · March 31, 2019, 3:07pm

So, dynamic loss scaling is actually implemented. Thanks.

For me, FP16 works rather well (vanilla fastai), but convergence is a bit delayed w.r.t. FP32 or wrt a fastai env in which apex is also installed. Tested on tesla v100 and 1080ti.

sgugger · March 31, 2019, 3:28pm

Yes, it’s actually the default in the Callback but not the to_fp16 function I just realized. Just fixed it in master so it’s the default for everywhere now.

Note that you will see a few iterations with no training because of the way dynamic loss scaling works: it starts with a really high scale that is divided by 2 as long as your overflow.

balnazzar · March 31, 2019, 3:35pm

Perfect, now it should work without any delay. I was experimenting such delay since I always used to_fp16.

Thanks!

tydlwav · March 31, 2019, 5:01pm

Oh, btw. For Apex, I’m using gcc 8.2.1 and cuda 10. They work perfectly.

tydlwav · April 1, 2019, 4:53am

So, the bug is in lr_find(), and the dev build from master branch now should have no issues anymore. Here’s the issue that refers to the bug: https://github.com/fastai/fastai/issues/1903

Thanks @sgugger for fixing it.

AlexanderChu · April 18, 2019, 7:55am

I have tested fp16 using

learn = language_model_learner(data_lm, TransformerXL).to_fp16()
learn = language_model_learner(data_lm, TransformerXL).to_fp16(dynamic=False)    
learn = language_model_learner(data_lm, TransformerXL)

using 1000 rows train, 100 valid
fp16,dynamic=True, time = 04:23
fp16,dynamic=False, time = 04:20
no fp16 time = 00:51

Why it is slower in fp16 mode?
Thanks!