learn.TTA(is_test=True) not supporting half precision models?

Mockingray · October 8, 2018, 7:17pm

Hi,
I ran into following error when trying to do learn.TTA(is_test=True).
‘RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same’
It looks like the input from learn.data.test_dl is not converted to half precision.

I tried to add a CallbackHandler([mp_cb]) to the validate() call as cb_handler under get_preds() in tta.py, where mp_cb is the MixedPrecision in learn.callbacks, hoping the callback handler would convert test images data to fp16, but the error is still there.

Can someone please advise if the observation is true, and suggest how TTA on test dataset can be done with fp16? Thanks

DrHB · October 8, 2018, 7:53pm

This is how I do so far… Maybe there is a better way…

Train your Model on FP16
Save Weights
Create new Learner (not FP16, this time)
Load saved weights
Make predictions

sgugger · October 8, 2018, 9:45pm

There is a better way
Just type data.train_dl.add_tfm(to_half) to have your test dataloader converting the tensors to half precision. I’ll add this in the MixedPrecision callback so that the bug is fixed.

DrHB · October 8, 2018, 9:45pm

awesome thanks!

MicPie · November 5, 2018, 9:03pm

Running learn.validate(data.valid_dl.add_tfm(to_half)) works but when I run learn.TTA(data.valid_dl.add_tfm(to_half)) I get the following error:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/functional.py in softmax(input, dim, _stacklevel)
    982     if dim is None:
    983         dim = _get_softmax_dim('softmax', input.dim(), _stacklevel)
--> 984     return input.softmax(dim)
    985 
    986 

RuntimeError: softmax is not implemented for type torch.HalfTensor

Creating a new learner without FP16 also creates an error:

~/anaconda3/envs/fastai/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
    311     def forward(self, input):
    312         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 313                         self.padding, self.dilation, self.groups)
    314 
    315 

RuntimeError: Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

Am I missing something?
Did something change in the library?
Is there also a trick for ClassificationInterpretation.from_learner(learn)?

Kind regards
Michael

sgugger · November 6, 2018, 12:25am

To use TTA you have to average the probabilities over different augmented inputs, so you need to transforms the last activation into the probabilities, that’s why there is a softmax there. You should compute it in full precision anyway, to avoid any numerical instability.

My advice would be to load your model in a clean learner in full precision for this.

tcapelle · November 6, 2018, 9:11pm

how do you do this?
just creating a model and loading your weigths (that are half tensors) does not work.
Do you have a trick like learn.data.valid_dl.add_tfm(to_half) but to transform everything back to float32?

A clean learner, just loading half weights transform the output in half tensors.

learn32 = Learner(data, arch , metrics=[accuracy_thresh, f1])
learn32.loss_func=FocalLoss()
learn32.load('dk_se_64')
p_v, t_v = learn32.get_preds()
p_v.dtype
>>torch.float16

sgugger · November 6, 2018, 9:29pm

I think it’s as simple as learn.model.float().

MicPie · November 7, 2018, 9:21pm

It worked without the learn.model.float() but I got this strange error at the beginning:

Input type (torch.cuda.HalfTensor) and weight type (torch.cuda.FloatTensor) should be the same

After checking the models modules weight types with

learn.model[0][0], learn.model[0][0].weight.type()
Out: (Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False),
 'torch.cuda.FloatTensor')

and finding out that they were already of type torch.cuda.FloatTensor I just tried to recreate the learner with a newly recreated databunch and it worked!

With this setup I could run learn.get_preds(), learn.validate(), learn.TTA(), and ClassificationInterpretation.from_learner(learn) without problems.

Therefore, it seems like the databunch gets transformed to FP16 when the FP16 learner gets created with it and this is making problems later on with a FP32 learner created with the old databunch?

sgugger · November 7, 2018, 9:31pm

When using FP16, all your dataloaders are converted in half precision by adding a transform to them:

self.learn.data.train_dl.add_tfm(to_half)

You can remove it with

self.learn.data.train_dl.remove_tfm(to_half)

MicPie · November 7, 2018, 9:50pm

Thank you for the explanation, with that everything makes sense.

chrisoos · December 13, 2018, 5:59am

Side question:
Is the FocalLoss() loss function you used a custom loss function or is it an inbuilt FocalLoss() function in fastai / pytorch?

tcapelle · December 13, 2018, 9:43am

Custom. You can use any Pytorch loss function with fastai.

import torch
import torch.nn as nn
import torch.nn.functional as F

class FocalLoss(nn.Module):
    def __init__(self, gamma=2):
        super().__init__()
        self.gamma = gamma
        
    def forward(self, input, target):
        if not (target.size() == input.size()):
            raise ValueError("Target size ({}) must be the same as input size ({})"
                             .format(target.size(), input.size()))

        max_val = (-input).clamp(min=0)
        loss = input - input * target + max_val + \
            ((-max_val).exp() + (-input - max_val).exp()).log()

        invprobs = F.logsigmoid(-input * (target * 2.0 - 1.0))
        loss = (invprobs * self.gamma).exp() * loss
        
return loss.sum(dim=1).mean()

anhnam_xtanh · April 30, 2019, 8:42am

I used above function but I got an error:

LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-23-399ce5aa3598> in <module>()
----> 1 learn.lr_find()
      2 learn.recorder.plot(suggestion=True)

5 frames

<ipython-input-22-fca0f4f684a5> in forward(self, input, target)
     11         if not (target.size() == input.size()):
     12             raise ValueError("Target size ({}) must be the same as input size ({})"
---> 13                              .format(target.size(), input.size()))
     14 
     15         max_val = (-input).clamp(min=0)

ValueError: Target size (torch.Size([16])) must be the same as input size (torch.Size([16, 100]))

tcapelle · April 30, 2019, 9:19am

What are you sizes?
Check the target size use:

x,t = next(iter(data.train_dl))
learn.model(x).shape, t.shape

anhnam_xtanh · April 30, 2019, 9:33am

I ran it and got:

(torch.Size([16, 100]), torch.Size([16]))

WaldoBG · July 11, 2019, 1:05am

Thanks, that works for me. That’s super helpful!

Cugtyt · July 13, 2019, 10:16am

Is that learn.to_fp16() and then learn.save() or learn.export() actually save in fp32 format? And load_learner is full fp32?

sgugger · July 13, 2019, 12:46pm

No, you have to do learn'to_fp32() to cast it back in full precision.