Hi, i’m using fastaiV1 to do textClassification. I have 2 classes in a balanced dataset (40% vs 60%). At the end of my training, i have a “satisfying” accuracy of 83% and a val_loss of 0.567601. This leads me to think that my model is working in the sense that it understands at least some part of the dataset.
However, I noticed this weird behaviour while taking a closer look at the outputs of the network :
The highest score of the second class is around 0.6 :
learn.predict()[0][:,1].max()
outputs 0.59
In my experience dealing with similar text datasets with fastaiV1, there were always exemple where the network was very confident and some where it was less confident.
The networks correctly classifies 83% of the dataset, so that means that entries that get a score between 0.5 and 0.6 in the second column are very likey to be in the 2nd class. In this situation, it seems weird to me that the network is not able to give higher score to entries classified in the 2nd class.
I tried by hand to change the value of the probability. I replaced every probability higher than 0.5 by 0.7 and lower than 0.5 by 0.3 (to simulate the network being more confident). When evaluating val_loss, I got 0.5113. I made my model better by doing this simple change.
I didn’t try to change the weight of the models, but i’m pretty confident that if I doubled all the weight connecting the last layer to the softmax layer, I would also get a lower val_loss (it would double the value of the difference in output, which would dilate the sigmoid function and make the network more confident).
This seems like really weird behaviour to me. If the network trained well (Which it seems it did since it got a satisfying accuracy), I shouldn’t be able to do “tricks” to get significant improvement of the score. Do you have any idea of why the network isn’t able to be confident (which would improve its loss)?
Edit : I noticed that in order to simulate doubling the weights (and bias) of the last layer, I could pass all probabilities into :
f : y -> y^2/(1+2(y^2-y))
Indeed, if x is a real number, and sigma the sigmoid, then sigma(2x) = f(sigma(x)). Doing that yields a loss of 0.5146 which is lower than the initial loss of 0.5676 which confirm my intuition that making the network more confident would improve loss and furthers even more my perplexity of why the network is not able to be more confident.
Edit 2 : More generally, multiplying the weight of the last layer by a is equivalent to apply to all probabilities :
f_a : y -> 1/(1+((1-y)/y)^a)
Once again, sigma(ax) = f_a(sigma(x)). Doing that with a=5, i get an even lower loss of 0.4863. Still confused