Model Parallelism and VRAM pooling

philchu · April 4, 2020, 6:50pm

Note that Model Parallel doesn’t offer speed-up per se, but rather a way to train a model too big to fit on a single GPU. If the model can fit on a single GPU, splitting it among several GPUs on a single host actually will slow down the training. See https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html#single-machine-model-parallel-best-practices.

But for Data Parallel, this posts suggests nvlink may help.