Lesson 3 CAMVID Half-precision issue (.to_fp16())

gbecon · November 9, 2018, 1:58pm

Hello,

When I try to run the Big version of images in lesson3-camvid.ipynb notebook in half-precision to avoid memory problems (as I am using 1080ti):

learn = Learner.create_unet(data, models.resnet34, metrics=metrics).to_fp16()

everything trains fine, and I get a pretty good 0.93 accuracy. But when I call learn.show_results(), I get the following error:

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

Any suggestions on how to fix this?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-46-c3b657dcc9ae> in <module>()
----> 1 learn.show_results()

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/fastai/vision/learner.py in show_results(self, ds_type, rows, figsize)
     47     def show_results(self, ds_type=DatasetType.Valid, rows:int=3, figsize:Tuple[int,int]=None):
     48         dl = self.dl(ds_type)
---> 49         preds = self.pred_batch()
     50         figsize = ifnone(figsize, (8,3*rows))
     51         _,axs = plt.subplots(rows, 2, figsize=figsize)

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/fastai/basic_train.py in pred_batch(self, ds_type, pbar)
    216         nw = dl.num_workers
    217         dl.num_workers = 0
--> 218         preds,_ = self.get_preds(ds_type, with_loss=False, n_batch=1, pbar=pbar)
    219         dl.num_workers = nw
    220         return preds

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/fastai/basic_train.py in get_preds(self, ds_type, with_loss, n_batch, pbar)
    209         lf = self.loss_func if with_loss else None
    210         return get_preds(self.model, self.dl(ds_type), cb_handler=CallbackHandler(self.callbacks),
--> 211                          activ=_loss_func2activ(self.loss_func), loss_func=lf, n_batch=n_batch, pbar=pbar)
    212 
    213     def pred_batch(self, ds_type:DatasetType=DatasetType.Valid, pbar:Optional[PBar]=None) -> List[Tensor]:

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/fastai/basic_train.py in get_preds(model, dl, pbar, cb_handler, activ, loss_func, n_batch)
     36     "Tuple of predictions and targets, and optional losses (if `loss_func`) using `dl`, max batches `n_batch`."
     37     res = [torch.cat(o).cpu() for o in
---> 38            zip(*validate(model, dl, cb_handler=cb_handler, pbar=pbar, average=False, n_batch=n_batch))]
     39     if loss_func is not None: res.append(calc_loss(res[0], res[1], loss_func))
     40     if activ is not None: res[0] = activ(res[0])

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/fastai/basic_train.py in validate(model, dl, loss_func, cb_handler, pbar, average, n_batch)
     49         for xb,yb in progress_bar(dl, parent=pbar, leave=(pbar is not None)):
     50             if cb_handler: xb, yb = cb_handler.on_batch_begin(xb, yb, train=False)
---> 51             val_losses.append(loss_batch(model, xb, yb, loss_func, cb_handler=cb_handler))
     52             if not is_listy(yb): yb = [yb]
     53             nums.append(yb[0].shape[0])

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     16     if not is_listy(xb): xb = [xb]
     17     if not is_listy(yb): yb = [yb]
---> 18     out = model(*xb)
     19     out = cb_handler.on_loss_begin(out)
     20 

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    477             result = self._slow_forward(*input, **kwargs)
    478         else:
--> 479             result = self.forward(*input, **kwargs)
    480         for hook in self._forward_hooks.values():
    481             hook_result = hook(self, input, result)

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    477             result = self._slow_forward(*input, **kwargs)
    478         else:
--> 479             result = self.forward(*input, **kwargs)
    480         for hook in self._forward_hooks.values():
    481             hook_result = hook(self, input, result)

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/torch/nn/modules/container.py in forward(self, input)
     90     def forward(self, input):
     91         for module in self._modules.values():
---> 92             input = module(input)
     93         return input
     94 

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    477             result = self._slow_forward(*input, **kwargs)
    478         else:
--> 479             result = self.forward(*input, **kwargs)
    480         for hook in self._forward_hooks.values():
    481             hook_result = hook(self, input, result)

~/anaconda3/envs/fastaiv1/lib/python3.7/site-packages/torch/nn/modules/conv.py in forward(self, input)
    311     def forward(self, input):
    312         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 313                         self.padding, self.dilation, self.groups)
    314 
    315 

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

sgugger · November 9, 2018, 3:45pm

Just check the dataloader you use (valid I’m guessing) is still in FP16 as it looks like it’s giving inputs in full precision.
You can add the transform that convert the tensor to half precision with:

learn.data.valid_dl.add_tfm(to_half)

Another workaround is to put back your model in full precision with learn.model.float().

jeremy · November 9, 2018, 4:51pm

We should fix things at our end so that this “just works”

gbecon · November 9, 2018, 6:24pm

I can confirm that running either

learn.model.float()
learn.show_results()

or

learn.data.valid_dl.add_tfm(to_half)
learn.show_results()

works fine.

It would be interesting to hear your opinion, which approach would be better (faster or more accurate) when serving model predictions in a “production” environment.

jeremy · November 9, 2018, 7:30pm

On CPU, you’d want to use the first approach I think.

Mauro · January 3, 2019, 10:46pm

I recently tried both of those approaches, but I still get the error:

Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

Is there anything else you tried in addition?

Edit: I got it working. I restarted the Kernel and tried just this one: learn.model.float(). If you try the other one first you’ll keep getting errors unless you restart the kernel.

KarlH · January 27, 2019, 7:30am

I’m trying to train in half precision. The issue I’m running into is the y values in the dataloader are not converted to half precision by adding a transformation via learn.data.train_dl.add_tfm(to_half). Is there a way to make the transform apply to y values as well?

pierreguillou · February 2, 2019, 2:32pm

Hello. When running learn.show_results(), there is an issue with the denormalize() function that is using float32 tensors (mean and std) while we put learn.to_fp16().

Details about the issue.

I had a CUDA memory issue running learn.show_results() with the dtype model = float32.

learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd)
learn.load('stage-1')
learn.show_results(rows=1, figsize=(8,9))

The error message was:

RuntimeError: CUDA out of memory. Tried to allocate 522.13 MiB (GPU 0; 8.00 GiB total capacity; 6.18 GiB already allocated; 58.66 MiB free; 37.75 MiB cached)

Then, in order to use float16, I ran the following code:

learn = unet_learner(data, models.resnet34, metrics=metrics, wd=wd).to_fp16()
learn.load('stage-1')
learn.data.valid_dl.add_tfm(to_half)
learn.show_results(rows=1, figsize=(8,9))

But this time, I got the following error:

The problem comes from the ImageNet mean/std tensors that are still in float32.
How to solve this issue ? Thanks.

jls · February 14, 2019, 7:50am

Thanks! It works!

champs.jaideep · July 21, 2019, 6:46am

my issue is reverse that is input in halftensor but weights are still in full tensor…
THis is while running learn.fit…
any bug ? here

champs.jaideep · July 21, 2019, 8:37am

using learn.model.half() helped sync input and weights…

surabhi · September 12, 2019, 1:08am

I was getting the same error. Just using “learn.model.float()” fixed the problem

franva · January 15, 2020, 12:54pm

thanks @gbecon

learn.model.float()

is my saver ~!!!