Mixed precision training

(Charl P. Botha) #21

I’m doing mixed precision training daily on my RTX 2070.

I just ran into some overflow issues using fastai’s fp16 support, but I’ve hacked the code a bit to use NVIDIA’s Apex amp for its dynamic loss scaling, and now this se-resnext50 is training without issues (so far) all in the 8GB of VRAM on the 2070.

fp16 / TensorCore support in CUDA10 and PyTorch means I can train much larger networks on this card than would be otherwise possible. In my more informal testing, there’s also a speed boost going from fp32 to fp16.

(Thomas) #22

Are this new RTX fast?

(Shiv Gowda) #23

For one of my workloads RTX 2070 is around 5-10% slower than 1080 Ti and around 50% slower than V100. In terms of price per value, I think it is the best Deep Learning GPU right now in the market.

Thanks @sgugger for the fp16 enhancement, I am able to train much larger models now(previously I could not train on 512x512 image because I could fit in only a batch size of 8, now I can train that dataset with a batch size of 56). There was a batch norm bug(most likely in pytorch) which prevented me from using it last week, but in the new PyTorch 1.0 release the bug seems to have been cleared and I can clearly see the value of this feature.

(Mohamed Hassan Elabasiri) #24

I am using fast.ai and Pytorch 1.0 and its working fine on my GTX 980ti.

(Andrea de Luca) #25

@cpbotha, would you share your apex integration with pytorch/fastai?

Apart from that, I’m trying to get mixed precision training working on a tesla V100 (cuda 10, drivers 415, fastai 1.0.42). No success my losses are always NaN since the very first epoch.


(Charl P. Botha) #26

Your request motivated me to quickly write up the procedure (it was on my todo list, just hadn’t gotten around to it). Read here https://vxlabs.com/2019/02/04/improving-fastais-mixed-precision-support-with-nvidias-automatic-mixed-precision/ and let me know if you find any issues. (note the epsilon fix!)

(Andrea de Luca) #27

Thanks Charl! :slight_smile: But which nvcc version did you use? I get the following error while trying to compile apex:

/usr/local/cuda/include/crt/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 6 are not supported!
 #error -- unsupported GNU version! gcc versions later than 6 are not supported!
error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1

My nvcc is release 9.0, V9.0.176

(Charl P. Botha) #28

I did this with the CUDA10 builds of pytorch.

(Andrea de Luca) #29

Me too. I did open an issued on at apex’ github…!

(Charl P. Botha) #30

In your comment, you mention nvcc release 9.0?

On my side, I see this:

$ /usr/local/cuda-10.0/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130