[Lesson 3 - Camvid]: DataParallel problem

If you do not know, I run dual 1080Ti cards. Up to this point, I have been able to use both of my cards by adding learn.model = torch.nn.DataParallel(learn.model, device_ids=[0, 1]) and adjusting the batch size to maximize the VRAM. This works great when the learn object is create_cnn. However, on this notebook using the latest fastai (1.0.21), I get an interrupt when running the fit_one_cycle

~/anaconda3/envs/course1018/lib/python3.6/site-packages/fastai/vision/models/unet.py in forward(self, up_in)
     35             up_out = F.interpolate(up_in, s.shape[-2:], mode='bilinear')
     36         up_out = self.upconv(up_out)
---> 37         cat_x = self.bn1(F.relu(torch.cat([up_out, s], dim=1)))
     38         x = self.bn2(F.relu(self.conv1(cat_x)))
     39         x = F.relu(self.conv2(x))

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 4 and 8 in dimension 0 at /opt/conda/conda-bld/pytorch-nightly_1540121100527/work/aten/src/THC/generic/THCTensorMath.cu:83

The notebook will run fine if the line enabling the dataParallel is commented out. It appears to be a incompatibility between unet.py and data_parallel.py

Any thoughts?

4 Likes

When I try it it works for me all the way up to either validation or the last batch. I suspect something is strange either because of changing batch sizes with respect to hooks, or batch size changes when validation happens. If there were some way to zero out the hooks, we could write a callback that would take care of that at the beginning of every epoch, if that is the problem.