Text classification - High Accuracy, low recall and low precision Ask Question

I am using fastai to create a text classifier that labels texts as either 0 or 1.

My data (number of 1’s and 0’s) for training is balanced, and I got an accuracy of 85%.

To test, I used a new unseen corpus of data - to try an mimic a real world scenario, that is not balanced, and produced a confusion matrix.

According to my confusion matrix, my precision and recall are really low at around 20 and 14 percent respectivly.

What are possible reasons for that ? What Can I do to improve these metrics ?

Did you use your combined corpus (train and test) when you trained your findtuned language model?

It looks like you have a validation set and test set. On validation set, your accuracy is 85% - what is the precision and recall there? On test set, precision and recall are 20%/14% - what is the accuracy there? Are the test sets and validation sets similar (same distribution, type of data etc.)?

I suspect you overfit on the training set, so you should gather more training data or add more regularization.

Oh when I trained the langauge model using my data, I added the english wiki to be pretrained. The test set was not included in the training, only the training set was used in the training.

I did that, to be able to know how well the model would do on unseen data ( my approach of trying to mimic a deployment scenario) . Is that a wrong approach on my end ?

I don’t know, it really depends on how closely your data is to the language model. If your data has a vocabulary that it was not trained on, it will have a lot of xxunk’s in it which could explain why your result isn’t what you are expecting. I would check out the new fast.ai nlp class. You might want to try using sentence piece for the encoding, it might help mitigate that problem, if that is even the problem.

On the validation set, yes the accuracy is 85%, the precision and recall are 81% and 87% respectively.

And on the test set the accuracy is 16%.

In regards to the test and validation sets, they do not have the same distribution of 1’s and 0’s. The validation set is pretty balanced - in regards to the numbers of 1’s and 0’s but the length of the texts differ.
However, t the test set is a random set I created. I did that because I thought that’s how I can mimic a real deployment situation. Is that wrong to do?

I apologise if this sounds silly, I am still learning as I go… but how do I check if I overfit the training set ? Also, can you please explain to me what you mean by regularization.

As you suggested, i’ll re-examine the test data and look at how closesly it to the training data.

I’ve also checked out the NLP course, if I understand correctly, the sentence peice encoding is for Agglutinative languages, correct ?

Thank you for responding to me, I really appreciate it!

Regularization is covered in Lessons 6 Practical Deep Learning for Coders. Generally, in order for your model to work in your real deployment situation, your training data needs to be representative of the test data, so that the model learns how to handle it. You might want to collect some data that you expect to see during deployment, and use it to finetune your model.

I understand. I’ll look at lesson 6. Just one last question, should i have made an equal number of 1’s and 0’s when i trained ? Or should I have made the data imbalanced?

I really appreciate all your help, thank you

Dido to what @darek.kleczek said! We are all learning, and I am far from an expert. My thought around using SentencePiece is that it gives your model a better chance to learn an unknown word, by using parts of the words. That being said, you might want to take a look at some of the interpretation methods to try to understand why your model isn’t performing as you expect it to. One of my takeaways from the nlp course was that if you over train a language model it won’t generalize as well.

In regards to the class imbalance question, I would say that depends on what is important to you.

1 Like

I’m not sure if class imbalance is the problem in your case - I’d first focus to find out if the actual text that you train with is similar to the text that you classify during the deployment. Can you share a sample of texts you’re trying to classify, both from your train and test sets?

Using this forum, I would like to ask one question. I am using Keras for text classification problem.
I have to classify word or group of words to a code. I created a model with embedding layer and pre trained weights. The sample size contains 75k such words to code. That is we have 2 column data one contains text (word or set of words) and the other one contains code against each text. With set split of 80% training and 20% validation, I get training and validation accuracy almost 100% or 99%. On testing separate set of data it gives 99% accuracy through validate model function of keras. However, on individual description input, I don’t get right result. The confusion matrix shows average 26% accuracy.

I would like to know what could be the causes of this problem?

Is it possible to get training & validation accuracy 99% and testing accuracy 99% on separate data but individual data gives different results?