"Be Careful What You Backpropagate" paper

This is an interesting paper from the past couple of days:

The gist is that maybe we shouldn’t be using softmax activation in the last layer of the network, since this might not produce the best gradients for training the rest of the model, and usually we care most about accuracy.

I haven’t tried to replicate the results in the paper, and they maybe don’t test their methods on an exhaustive set of problems (or using particularly realistic network architectures/optimizers). But this does seem to be a pretty interesting idea: if we care about accuracy, is softmax really the best activation? Slash is CE really the best loss?

It might be interesting to do some kind of search over the space of activation/loss functions to see whether we can find one that does better in terms of accuracy than softmax/CE. It would also be interesting to see whether there are other activations/losses that produce better representations than softmax/CE. (Strong evidence that there are: https://arxiv.org/pdf/1704.08063.pdf)

Would be very interested to hear about any work that’s been done in this direction (either in the academic or Kaggle communities – I somehow suspect the latter may have thought about this more, since they really care about accuracy.)

EDIT: FWIW, the results in Table 1 of the paper are a little weird – if you run the Keras MNIST CNN example, you get to < 1% error in < 10 epochs. So their results might be real but not realistic, as I said above. Either way, I still think it’s an interesting line of thought.

Thanks for sharing the paper.

You may like thread here. It discusses a toy dataset that’s easily fitted using a tanh activation as opposed to the conventional choice of a relu.

I tried to reproduce the results for MNIST - http://anishshah.github.io/ml/2017/07/17/Gradient-Boosting.html

Cool. Would be very interesting to see your experiments repeated w/ a more realistic network architecture/problem. Correct me if I’m wrong, but I feel like we shouldn’t really care about the performance of the method on this small FCN – I don’t think there’s really a priori reason to thing that these methods would be any use when used in a deeper/bigger/more modern network.