I’m currently playing about with a couple of binary classification problems (outputs are either disease or no disease) and am finding that one hot encoding of the outputs - e.g. disease [0 1] and no disease [1 0] - is giving better results than just disease (0) vs normal (1).
Do other people find this?
I remember when I first started playing about with NNs (too long ago to count now …) I used to use MATLAB, and that is the approach they have always taken in their examples - e.g.
However, most of the binary classification examples I’ve followed using keras and pytorch tend to use the approach of 0 / 1 and a single dense sigmoid neuron. Is there a reason for this that I’ve missed?
I’ve had similar experiences. In lessons and textbooks they say it’s the same, but in practice I’ve had many cases where one hot binary gets better results than sigmoid.
I guess it could be that we are introducing more variables to the system: one hot is like additional 2 neuron layer. In case of sigmoid we go straight from model output to probability.
Thanks! I’m glad it’s not just me
I’m one hot encoding the output classes, using categorical_crossentropy for the loss function, and am getting both faster convergence and higher accuracy! (at least on the problems I’m currently looking at …)
I’ve checked over the predictions I’m making on test data and they seem to be sensible too! I was just curious to see if there was a non obvious (at least to me) reason for not doing it.
You could experiment with logsoftmax also, sometimes it behaves better as they say, as it has better mathematical properties. Also instead of crossentropy could try nllloss