To_parallel, to_distributed, and PeakMemMetric

austinmw · March 12, 2019, 4:45pm

Hi, I’m curious about the usage of these new callbacks as well as their interaction with PeakMemMetric.

I’ve first tried just calling learn.to_parallel() on a ml.p3.8xlarge EC2 instance (4 V100’s). I assumed that I should multiply my batch size by number of GPUs, but this resulted in a CUDA OOM crash (original bs worked with 1 GPU). I then tested with/without to_parallel with the same batch size and ended up with the same epoch times and PeakMemMetric results. I’m running this through SageMaker so it’s a bit difficult to query the GPUs with nvidia-smi, but it seems that there might be only 1 active. Anybody have thoughts/suggestions?

Also, what’s the expected behavior of PeakMemMetric when using parallel or distributed training?

Edit: btw, testing with a unet_learner.

sgugger · March 12, 2019, 5:05pm

unet_learner doesn’t work with DataParallel (as documented here), you need to use a script and distributed.

austinmw · March 12, 2019, 5:08pm

Ah, missed that note. Thanks!

basu · April 16, 2019, 2:32am

learn.to_parallel()

AttributeError Traceback (most recent call last)
in
----> 1 learn.to_parallel()

AttributeError: ‘Learner’ object has no attribute ‘to_parallel’

fastai 1.0.51 shows this error, any suggestions?

I am using resnet50

sgugger · April 16, 2019, 12:56pm

Did you import everything from fastai.distributed?

basu · April 17, 2019, 2:29am

thank you !
after importing fastai.distributed its working now.

akmall · July 19, 2019, 9:12am

I came across an issue with to_distributed with unet_learner given arbitrary input sizes for the image, the computation takes much longer than if provided fixed image size, is this a common behaviour for the module?