Mixed precision training error with InceptionV3 model

dzhang · April 4, 2019, 12:23am

Hi, I am training a modified InceptionV3 model on AWS P3 instance with fastai V1.0.51, the training was fine with regular precision, but I got error when I tried mixed precision.

here is the the code

learn = Learner(data, model, metrics=error_rate, callback_fns=InceptionV3Trainer)
learn.split(lambda m: (m.features_after_6e, m.logits))
learn.freeze()    # this only skip last layer group, which is logits, it also freeze auxLogits
requires_grad(model.AuxLogits, True)  # unfreeze auxLogits
apply_init(model.logits, nn.init.kaiming_normal_)   
apply_init(model.AuxLogits, nn.init.kaiming_normal_)

learn = learn.to_fp16()

learn.lr_find()

I got following error

RuntimeError                              Traceback (most recent call last)
<ipython-input-10-d81c6bd29d71> in <module>()
----> 1 learn.lr_find()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, wd)
     30     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
     31     epochs = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 32     learn.fit(epochs, start_lr, callbacks=[cb], wd=wd)
     33 
     34 def to_fp16(learn:Learner, loss_scale:float=None, max_noskip:int=1000, dynamic:bool=True, clip:float=None,

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    194         callbacks = [cb(self) for cb in self.callback_fns] + listify(callbacks)
    195         if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
--> 196         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    197 
    198     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
     98             for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
     99                 xb, yb = cb_handler.on_batch_begin(xb, yb)
--> 100                 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
    101                 if cb_handler.on_batch_end(loss): break
    102 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     31     if opt is not None:
     32         loss,skip_bwd = cb_handler.on_backward_begin(loss)
---> 33         if not skip_bwd:                     loss.backward()
     34         if not cb_handler.on_backward_end(): opt.step()
     35         if not cb_handler.on_step_end():     opt.zero_grad()

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    100                 products. Defaults to ``False``.
    101         """
--> 102         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    103 
    104     def register_hook(self, hook):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     88     Variable._execution_engine.run_backward(
     89         tensors, grad_tensors, retain_graph, create_graph,
---> 90         allow_unreachable=True)  # allow_unreachable flag
     91 
     92 

RuntimeError: Function AddBackward0 returned an invalid gradient at index 1 - expected type torch.cuda.HalfTensor but got torch.cuda.FloatTensor

any suggestions?

Thanks

dzhang · April 4, 2019, 4:06pm

I stepped through the code with pdb, narrowed the cause down a bit.

in my model, I use callback to handle two outputs from Inception V3

class InceptionV3Trainer(LearnerCallback):
    def on_loss_begin(self, last_output, **kwargs):
        "this will be called for both training and inference, so need to handle both"
        if self.learn.model.training:
            "Save aux outputs for later and only returns the true output."
            out, self.aux_out = last_output
        else:   # inference, single out
            out = last_output
        return {'last_output': out}

    def on_backward_begin(self, last_loss, last_target, **kwargs):
        "get weighted sum of loss, this is only called when training"
        aux_loss = self.learn.loss_func(self.aux_out, last_target)
        last_loss += 0.4 * aux_loss
        return {'last_loss': last_loss}

I noticed, before go into on_loss_begin, all tensors are fp16, but after that it becomes fp32, I think fastai does not take into these callbacks into considerations when handle mixed precision.

any idea on how do I make this work?

Thanks!

sgugger · April 4, 2019, 4:16pm

The loss is computed in FP32 to avoid overflow there. This is done in the MixedPrecisionCallback by setting the output to float precision in on_loss_begin. That callback is the last to run, so you don’t see it inside yours. I think you should apply .float() to your stored aux_out so that aux_loss is also a float in FP32.

dzhang · April 4, 2019, 5:28pm

convert aux_out to float solved the problem. Thank you very much for your help!

dzhang · April 17, 2019, 7:29pm

Hi,

I ran into another issue now with mixed precision.

when I call plot_top_losses with heatmap on, I got the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-14-29a3bc94c305> in <module>()
----> 1 plot_top_losses(interp, 9)   # call the modified function in utils

~/marco/utils.py in plot_top_losses(interp, k, largest, figsize, heatmap, heatmap_thresh, return_fig)
     69                 with hook_output(m.features_after_6e, grad= True) as hook_g:
     70                     preds = m(xb)
---> 71                     preds[0,cl].backward()
     72             acts = hook_a.stored[0].cpu()
     73             if (acts.shape[-1]*acts.shape[-2]) >= heatmap_thresh:

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    100                 products. Defaults to ``False``.
    101         """
--> 102         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    103 
    104     def register_hook(self, hook):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     88     Variable._execution_engine.run_backward(
     89         tensors, grad_tensors, retain_graph, create_graph,
---> 90         allow_unreachable=True)  # allow_unreachable flag
     91 
     92 

RuntimeError: expected scalar type Half but found Float

I suspect somewhere I need to do conversion for my extra model output. here is my callback code

class InceptionV3Trainer(LearnerCallback):
    def on_loss_begin(self, last_output, **kwargs):
        "this will be called for both training and inference, so need to handle both"
        if self.learn.model.training:
            "Save aux outputs for later and only returns the true output."
            out, aux_out = last_output
            "Convert half precision output to FP32 to avoid reduction overflow, fastai MixedPrecision callback does not handle extra outputs"
            self.aux_out = aux_out.float()
        else:   # inference, single out
            out = last_output
        return {'last_output': out}

    def on_backward_begin(self, last_loss, last_target, **kwargs):
        "get weighted sum of loss, this is only called when training"
        aux_loss = self.learn.loss_func(self.aux_out, last_target)
        last_loss += 0.4 * aux_loss
        return {'last_loss': last_loss}

notice in on_loss_begin I already convered aux_out to 32 float. don’t know why plot_top_losses gave the error.

also, I use a modified version of plot_top_losses, because my model is not nn.Sequential, original function gave an error, here is my version

def plot_top_losses(interp, k, largest=True, figsize=(12, 12), heatmap=True, heatmap_thresh=16):
    "Show images in `top_losses` along with their prediction, actual, loss, and probability of actual class."
    tl_val, tl_idx = interp.top_losses(k, largest)
    classes = interp.data.classes
    cols = math.ceil(math.sqrt(k))
    rows = math.ceil(k/cols)
    fig, axes = plt.subplots(rows, cols, figsize=figsize)
    fig.suptitle('prediction/actual/loss/probability', weight='bold', size=14)
    for i, idx in enumerate(tl_idx):
        im, cl = interp.data.dl(interp.ds_type).dataset[idx]    # image and catogory
        cl = int(cl)
        im.show(
            ax=axes.flat[i], title=f'{classes[interp.pred_class[idx]]}/{classes[cl]} / {interp.losses[idx]:.2f} / {interp.probs[idx][cl]:.2f}')
        if heatmap:
            xb, _ = interp.data.one_item(im, detach=False, denorm=False)    # Get item into a batch
            m = interp.learn.model.eval()
            with hook_output(m.features_after_6e) as hook_a:
                with hook_output(m.features_after_6e, grad=True) as hook_g:
                    preds = m(xb)
                    preds[0, cl].backward()
            acts = hook_a.stored[0].cpu()
            if (acts.shape[-1]*acts.shape[-2]) >= heatmap_thresh:
                grad = hook_g.stored[0][0].cpu()
                grad_chan = grad.mean(1).mean(1)
                mult = F.relu(((acts*grad_chan[..., None, None])).sum(0))
                sz = list(im.shape[-2:])
                axes.flat[i].imshow(mult, alpha=0.6, extent=(
                    0, *sz[::-1], 0), interpolation='bilinear', cmap='magma')

exactly the same, except instead of m[0], I directly use the last layer of the model body.

Thanks!

mindtrinket · May 8, 2019, 3:43pm

Did you solve this problem? I am running into the same thing.

KarlH · May 8, 2019, 8:00pm

Convert the learner to fp32 before using interp

dzhang · May 9, 2019, 2:22pm

I did not find a solution, I was hoping somebody from fastai team can suggest some ideas.

I don’t think convert learner back to fp32 is the right solution, as I don’t have to do that to use interp to plot_top_losses, as long as I don’t turn heatmap on.