Can I use top_k_accuracy as a loss function?

Whenever I try to change the loss function to top_k_accuracy, I tend to get errors. Am I using it correctly?

learn = cnn_learner(data, 
                   arch,
                    metrics=[accuracy, top_k_accuracy]
                    ,loss_func=top_k_accuracy
                    ,callback_fns=[ShowGraph]
                   ).to_fp16().mixup()

The metric correctly calculates until I try to set it as a loss function.

The error continues to shift on me. Most commonly
got an unexpected keyword argument 'reduction'
Which led to updating the system, but to no success. I have also seen tensors are not correctly defined or arguments not properly passed.

Hi James,

As far as I know top_k_accuracy is a metric, not a loss function. A loss function internally performs mathematical operations oriented to compare how well were the predictions of the net with respect to the real values during the training step and assigns scores to be back-propagated through the network and punished false negatives/positives and so on depending on how the loss function was design, while a metric performs much more simple aggregation operations. Indeed you can see the definition of top_k_accuraccy from the documentation:

def top_k_accuracy(input:Tensor, targs:Tensor, k:int=5)->Rank0Tensor:
    "Computes the Top-k accuracy (target is in the top k predictions)."
    input = input.topk(k=k, dim=-1)[1]
    targs = targs.unsqueeze(dim=-1).expand_as(input)
    return (input == targs).max(dim=-1)[0].float().mean()

In this line

input.topk(k=k, dim=-1)[1]

I assume top_k_accuracy is calling the topk method from PyTorch. This method returns two values, first the probabilities and second the predicted classes and as you can see, top_k_accuracy is only taking care for the predicted classes, so all hope in use this method as a valid loss function have been destroyed, (you can’t use the classes to back-propagate any useful information to trained your network, for that you need your loss function output either scores or probabilities.

But maybe you can adapt that function to do what you want, an easy try should be:

input.topk(k=k, dim=-1)[0]

good luck

3 Likes

Thanks for the great response! Tried new function and you are right, it will need some more adaption.

I will likely move onto some other areas and come back to this later.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-16-399ce5aa3598> in <module>
----> 1 learn.lr_find()
      2 learn.recorder.plot(suggestion=True)

~/anaconda3/lib/python3.7/site-packages/fastai/train.py in lr_find(learn, start_lr, end_lr, num_it, stop_div, wd)
     30     cb = LRFinder(learn, start_lr, end_lr, num_it, stop_div)
     31     epochs = int(np.ceil(num_it/len(learn.data.train_dl)))
---> 32     learn.fit(epochs, start_lr, callbacks=[cb], wd=wd)
     33 
     34 def to_fp16(learn:Learner, loss_scale:float=None, max_noskip:int=1000, dynamic:bool=True, clip:float=None,

~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(self, epochs, lr, wd, callbacks)
    197         callbacks = [cb(self) for cb in self.callback_fns + listify(defaults.extra_callback_fns)] + listify(callbacks)
    198         if defaults.extra_callbacks is not None: callbacks += defaults.extra_callbacks
--> 199         fit(epochs, self, metrics=self.metrics, callbacks=self.callbacks+callbacks)
    200 
    201     def create_opt(self, lr:Floats, wd:Floats=0.)->None:

~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in fit(epochs, learn, callbacks, metrics)
     99             for xb,yb in progress_bar(learn.data.train_dl, parent=pbar):
    100                 xb, yb = cb_handler.on_batch_begin(xb, yb)
--> 101                 loss = loss_batch(learn.model, xb, yb, learn.loss_func, learn.opt, cb_handler)
    102                 if cb_handler.on_batch_end(loss): break
    103 

~/anaconda3/lib/python3.7/site-packages/fastai/basic_train.py in loss_batch(model, xb, yb, loss_func, opt, cb_handler)
     32     if opt is not None:
     33         loss,skip_bwd = cb_handler.on_backward_begin(loss)
---> 34         if not skip_bwd:                     loss.backward()
     35         if not cb_handler.on_backward_end(): opt.step()
     36         if not cb_handler.on_step_end():     opt.zero_grad()

~/anaconda3/lib/python3.7/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    105                 products. Defaults to ``False``.
    106         """
--> 107         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    108 
    109     def register_hook(self, hook):

~/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     91     Variable._execution_engine.run_backward(
     92         tensors, grad_tensors, retain_graph, create_graph,
---> 93         allow_unreachable=True)  # allow_unreachable flag
     94 
     95 

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Two points here:

First of all, the optimizer tries to minimize the loss function, so if you use accuracy as a loss function, you will push the net to get the worst possible result.

Secondly, I am not sure pytorch allows you to get gradient of the accuracy (I don’t think you can get the derivative of input == targs).

On the other hand, if you really really want, you could modify the cross entropy loss to take into account the sum of the top k probabilities and then all the rest, but I suspect you would not get good improvement.

You could, though, try modifying the temperature of the softmax. That might be an easy change that could get you better top-k accuracy at the expense of a worse loss (because the net will make less confident predictions0

1 Like

Thanks this helps. I made the mistake of thinking accuracy was the loss function being used (it’s FlattenedLoss of CrossEntropyLoss()) instead of a metric.

I fear my model has too many classifications and wanted to make sure progress was being made forward if it was at least close. A “that’s wrong, but you are moving in the right direction”.

I was wondering if slowly reducing the top k as the model got better would help. A kinda, if you call a Boston Terrier an American Terrier at the beginning of training, that’s ok. However, later on, you will need to know the difference.

Maybe that isn’t necessary with CrossEntropyLoss.

Two possibilities I can think of that might help.

1 - Start training with a high temperature in the softmax (not sure if there’s a built in way of doing it, you’ll have to check) and put it back 1 at later stages of training. As far as I understand, although the loss will be higher in the first part of the training, you will force the net to learn more transferable feature, and after that you would be fine tuning for accuracy. I am taking this idea from this paper.

2 - Start training with less granular classes (in your example above you would train the network with a big “Terrier” classes. After that, build a new model with all the classes and load the pretrained weight. This has helped me in the past in cases where I had a lot of classes arranged in some kind of hierachy. It did not make the model achieve better overall performances, but it made it converge way faster.

You can probably also try and combine the two strategies and see what happens.

1 Like

Hi @edxz7,
I have a multi_label scenario and I wanted to make use of top_k_accuracy metric where k is 3, so I have written the below partial function:
top_k_accuracy_3 = partial(top_k_accuracy, k=3)

learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=1, metrics=[top_k_accuracy_3])

Now, when I start tuning my domain specific language classifier
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=1, metrics=[top_k_accuracy_3])
with the below one
learn.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))

I get the following error:

RuntimeError: expand(torch.cuda.FloatTensor{[16, 55, 1]}, size=[16, 3]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (3)

Can someone tell me what is wrong with my approach?

1 Like

You need a smooth, differentiable function to optimize for. This paper might be of interest:

1 Like

Hi Sai

I think your problem is related with the incompatibility between the shapes of the tensors: targs and input defined in the top_k_accuracy method (see my post above).

As you can see in the error trace, the expand_as method is triggering the error. If you check the docstring for this method you will see:

expand_as(other) -> Tensor

Expand this tensor to the same size as :attr:`other`.

So this method is used to make the sizes between this and other equal if the size of this is less or equal to the size of the other. But in your case this method is trying to expand a this tensor whose size is bigger than the other.

To solve the problem, you need squeezed one of the two tensor before this calculation take place to make compatible its sizes. Because I don’t have the whole error trace and the details of your data I can’t tell you more, but I think you need squeeze the targs (this) tensor, so you will need to traceback your inputs to identify who is the targs tensor. I’m pretty sure that this role is played by your data_clas tensor, so an easy try to fix your problem is:

data_clas = data_clas.squeeze()

But I really don’t known :smile: .

Good luck.

1 Like