Known issues & nuances with multi-GPU & fp16 training?

Been experimenting with fp16 and multi-GPU training, and I’ve observed a number of known issues. I’m using the latest commit from the repo directly, which should be more patched than “pip install fastai==2.0.0”

First off, my understanding is that fp16 is supposed to approx half the memory requirements of a model, and if you’re running on newer GPUs with tensor cores, you should get a 2-3X speedup.

EDIT: It’s more complicated, see my follow up post.

Thus far on single-GPU jobs:

  • learn.to_fp16() does halve the memory requirements (I can use approx 2X batch size), but there is no speedup at all.

  • learn.to_native_fp16() does not half the memory requirement and there is no speedup. In practice, it seems to do nothing compared to normal fp32 training

For multi-GPU jobs using Parallel (not Distributed Parallel):

  • learn.to_fp16() does halve the memory requirements somewhat. On twin GPUs setup, I see that GPU #1 uses more memory than GPU #2, so it’s more like 1.5-2.0X increase in batch size. As before, there is no speedup at all.

  • learn.to_fp16() is very YMMV. I had the same code, same container, same drivers crash on a different machine… and I do not know why. Error message indicates data may not be sent to the right device?

    RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

  • learn.to_native_fp16() does not half the memory requirement and there is no speedup. It does not cause crash in parallel mode, but again, I’m not sure if it’s working properly.

In short, learn.to_fp16() seems to be the best option, but there is still no speedup from tensor cores (why?), and it causes issue with multi-GPU training (sometimes).

I’ve seen some folks here mention about APEX, and being supported by NVIDIA it looks like a better option, but it looks like there is no APEX fp16 compatibility with fastai v2?

Is your experience consistent with mine? Or have you found a better way to do multi-GPU fp16 training?

Ok, apparently fp16 speedup and memory reduction is network architecture specific!

Results from my ongoing single-GPU runs.

  • On one architecture, with learn.to_fp16() I saw no speedup but with memory reduction, with learn.to_native_fp16() I saw no speedup and no memory reduction
  • On another architecture, with learn.to_fp16() and learn.to_native_fp16() I saw memory reduction, although learn.to_fp16() had slightly more memory reduction than learn.to_native_fp16(). And both learn.to_fp16() and learn.to_native_fp16() saw a 1.6X speedup against the fp32 baseline.

Multi-GPU ris even worse: learn.to_native_fp16() offers almost no memory reduction and no speed improvement. learn.to_fp16() has both memory reduction and speed improvement, for some architectures, but not all.

I dug further and found that learn.to_native_fp16() uses the pytorch-native APEX code in pytorch 1.6, so it should be as good as it get, since it’s supported by both Pytorch and NVIDIA? But… fastai fp16() mode is still doing better in every scenario (where it runs), although compatibility with multi-GPU is going to be YMMV. Also while native fp16() model works with multi-GPU, it makes no difference in terms of speed or memory reduction, so it might as well be broken.