I’m doing mixed precision training daily on my RTX 2070.
I just ran into some overflow issues using fastai’s fp16 support, but I’ve hacked the code a bit to use NVIDIA’s Apex amp for its dynamic loss scaling, and now this se-resnext50 is training without issues (so far) all in the 8GB of VRAM on the 2070.
fp16 / TensorCore support in CUDA10 and PyTorch means I can train much larger networks on this card than would be otherwise possible. In my more informal testing, there’s also a speed boost going from fp32 to fp16.
For one of my workloads RTX 2070 is around 5-10% slower than 1080 Ti and around 50% slower than V100. In terms of price per value, I think it is the best Deep Learning GPU right now in the market.
Thanks @sgugger for the fp16 enhancement, I am able to train much larger models now(previously I could not train on 512x512 image because I could fit in only a batch size of 8, now I can train that dataset with a batch size of 56). There was a batch norm bug(most likely in pytorch) which prevented me from using it last week, but in the new PyTorch 1.0 release the bug seems to have been cleared and I can clearly see the value of this feature.
@cpbotha, would you share your apex integration with pytorch/fastai?
Apart from that, I’m trying to get mixed precision training working on a tesla V100 (cuda 10, drivers 415, fastai 1.0.42). No success my losses are always NaN since the very first epoch.
Thanks Charl! But which nvcc version did you use? I get the following error while trying to compile apex:
/usr/local/cuda/include/crt/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 6 are not supported!
#error -- unsupported GNU version! gcc versions later than 6 are not supported!
^~~~~
error: command '/usr/local/cuda/bin/nvcc' failed with exit status 1
I’m seeing really bad performance drops with fastai to_fp16 on their lesson 3 planet notebook. I am using an RTX 2070 with 8GB of VRAM and CUDA 10. I am seeing a halved f-score on with the same code except for the to_fp16 bit on that resnet-50 model. I’ve tried to use vanilla pytorch with Nvidia’s apex to train a resnet-34 with fp16. It gave me validation accuracies within 0.1%, which means mixed precision should work fine on my setup. I’m wondering what fastai has done wrong here.
I’ve done some more testing with loss scaling of 128, 1024, the default 512 and dynamic loss scaling without success. I opened an issue here. I wonder if fastai offers some way of printing out gradients while training for debugging purposes.
So, dynamic loss scaling is actually implemented. Thanks.
For me, FP16 works rather well (vanilla fastai), but convergence is a bit delayed w.r.t. FP32 or wrt a fastai env in which apex is also installed. Tested on tesla v100 and 1080ti.
Yes, it’s actually the default in the Callback but not the to_fp16 function I just realized. Just fixed it in master so it’s the default for everywhere now.
Note that you will see a few iterations with no training because of the way dynamic loss scaling works: it starts with a really high scale that is divided by 2 as long as your overflow.
So, the bug is in lr_find(), and the dev build from master branch now should have no issues anymore. Here’s the issue that refers to the bug: https://github.com/fastai/fastai/issues/1903