High accuracy but low confidence on Text Classification

Hi, i’m using fastaiV1 to do textClassification. I have 2 classes in a balanced dataset (40% vs 60%). At the end of my training, i have a “satisfying” accuracy of 83% and a val_loss of 0.567601. This leads me to think that my model is working in the sense that it understands at least some part of the dataset.

However, I noticed this weird behaviour while taking a closer look at the outputs of the network :
The highest score of the second class is around 0.6 :


outputs 0.59

In my experience dealing with similar text datasets with fastaiV1, there were always exemple where the network was very confident and some where it was less confident.

The networks correctly classifies 83% of the dataset, so that means that entries that get a score between 0.5 and 0.6 in the second column are very likey to be in the 2nd class. In this situation, it seems weird to me that the network is not able to give higher score to entries classified in the 2nd class.

I tried by hand to change the value of the probability. I replaced every probability higher than 0.5 by 0.7 and lower than 0.5 by 0.3 (to simulate the network being more confident). When evaluating val_loss, I got 0.5113. I made my model better by doing this simple change.

I didn’t try to change the weight of the models, but i’m pretty confident that if I doubled all the weight connecting the last layer to the softmax layer, I would also get a lower val_loss (it would double the value of the difference in output, which would dilate the sigmoid function and make the network more confident).

This seems like really weird behaviour to me. If the network trained well (Which it seems it did since it got a satisfying accuracy), I shouldn’t be able to do “tricks” to get significant improvement of the score. Do you have any idea of why the network isn’t able to be confident (which would improve its loss)?

Edit : I noticed that in order to simulate doubling the weights (and bias) of the last layer, I could pass all probabilities into :

f : y -> y^2/(1+2(y^2-y)) 

Indeed, if x is a real number, and sigma the sigmoid, then sigma(2x) = f(sigma(x)). Doing that yields a loss of 0.5146 which is lower than the initial loss of 0.5676 which confirm my intuition that making the network more confident would improve loss and furthers even more my perplexity of why the network is not able to be more confident.

Edit 2 : More generally, multiplying the weight of the last layer by a is equivalent to apply to all probabilities :

f_a : y -> 1/(1+((1-y)/y)^a)

Once again, sigma(ax) = f_a(sigma(x)). Doing that with a=5, i get an even lower loss of 0.4863. Still confused :thinking:

1 Like

Day 5 : Still no idea why there is my network is not confident. Here are some more infos :

  • I used byte tokenization (I have only 256 tokens), and augmented the max_length and bptt attribute to get satisfying result.
  • I also trained a regular word-tokenized model on my dataset. At the moment this other model performs better (around 90% accuracy). I have tried to combine both of my model by averaging them. With my initial byte model(the not confident one), combining model is useless, it lowers my accuracy. However, if I artificially increase my confidence like i described in my previous post with a=5, i get a better model by combining those two models. This is understandable. If my byte model is not confident, it cannot go against the will of the word model(if the word models gives a low score, the byte model won’t be able to make it go above 0.3 = 0.6 / 2), but the fact that making the byte model more confident allow the combination of models to perform further reinforce my intuition that my byte model performs nicely on this classification task.

Any insight would be appreciated :slight_smile:

1 Like

I’ve done some thinking, and i might have an idea. In my Byte models, i still want to learn “long” dependencies, in my different try, i ended up with bptt=420 giving me the best results. Compared to my word model(with bptt=70), i noticed i lowered my learning rate to have the best training. This leads me to the following idea : since the RNN is unrolled 420 times, even with a low learning rate, it is able to learn, but the data only goes through the head of the classifier once, at the end of the read. In that regard, it seems plausible to me that the learning rate is too small for the head of the classifier, and therefore, it’s not able to update its weight to be more confident. The obvious solution would be to use different learning rates for the head of the classifier and the rest of the network, I might try that in the next few days.

I don’t know if my interpretation makes sense, i don’t have enough experience with RNN to be certain. What do you think of it?

1 Like

After some testing, I believe my interpretation was correct. If i let my classifier train for more epochs, the accuracy and loss vary a little, but the confidence of my network goes up with time. Training for 3 times more epochs, my highest score is now 0.75 instead of 0.6. The solution to have a fast training would be to increase the learning rate for the head of my classifier. After reading some of the code, I guess it is possible to do so with slices. I’m not familiar with them yet, If one of you has a working example of using slices to increase the lr of the head and could share it with me, I would be very grateful.


Old thread but reviving for future learners.

I had the same problem noted here, my model produced high accuracy (~90%) but very low confidence in predictions (max of ~20%). Note: in my case though I’m doing Multi-Category Classification so perhaps a bit of a different beast.

The comment @StatisticDean made regarding different learning rates on different layers made me recall something Jeremy did in the 2020 course. Unfreezing my whole model and applying a gradual learning rate allowed me to keep the same high accuracy (~90%) but fixed my low confidence in predictions (new max of ~85%).

Previous training method (Low confidence)

learn.fit_one_cycle(8, 3e-1)

Previous training method (Yeilds High confidence)

learn.fit_one_cycle(2, slice(1e-2/(2.6**4),1e-2))

For more info reference the Basic NLP example in the docs:

1 Like