How to handle OOV words

karan · June 10, 2019, 1:09pm

Would like to know how to handle out of vocabulary words in text classification.
for example, in my model, there is ‘thanks’ intent so if the text is like thanks,thanks a lot, etc will get classify but if something like ‘thankyou’ gets into wrong intent.
It seems an issue of OOV word.
Would like to know out other people are handling such issues?

RoseG · January 7, 2020, 9:54pm

I would like to rephrase this question. If there are OOV words in my input text for classification a UNK token is assigned. Is this UNK token then treated like any other in-vocab word or does it get special treatment during classification? Using the heatmaps from the interpreter, it looks like it is giving UNK attention, but this seems counter-productive . Does anyone have thoughts on where to look to increase understanding of this issue and the impact upon the output classification and confidence?

Isbister · January 8, 2020, 2:11pm

Is it possible to change tokenizer and perhaps use wordpieces instead?

RoseG · January 8, 2020, 2:28pm

Hi, Thank you for this suggestion. And yes, I think that will lead to less UNK tokens. But my question is, given that we will always have some UNK tokens, how are they treated and how should they be treated in classification?

Isbister · January 8, 2020, 2:36pm

To the best of my knowledge they are represented as any other token.

An advice to keep down the number of UNKs is to train your own language model with your own corpus, instead of using the one trained on Wikipedia.

RoseG · January 8, 2020, 7:48pm

Thanks. From searching around the forum, it seems that the general approach to UNKs is to reduce perplexity by improving the language model, which is an excellent approach. However, I think there is another issue here and it relates to confirmation bias in classifiers. When we encounter words we don’t understand (which will happen), how should we impact the output of the model?
There are 2 issues here, I think.

As the models get better, there will be less UNKs in the training, and so less impact of UNK on the confidence of the classification. The overall model will have great perplexity and accuracy metrics, which will lead us to discount the impact of new UNK.
If all UNK tokens in the training set are treated similarly we are lumping together many different concepts or words. This will tend to classify minorities similarly, even if there is no similarity other than “I don’t understand you”
I am worried this will lead to confirmation bias when there are people or perspectives that we haven’t trained for. For example, if we don’t understand a particular patient dialect, we might discount their most important symptom and misclassify them (and be completely aware of the misclassification).
This might be more of an application issue, but wondered how people are thinking about this in their models. How should we be handling this to avoid feedback loops and bias? Should UNK be handled differently from other tokens? Are there mechanisms for us to highlight the “UNK”, so we give minorities a voice? Is this an active learning issue? I would love to hear what people are thinking.