I have a binary Text classification task for web articles in English which has data that is highly imbalanced. What should be my approach in this case?
Should I go for Transformers or go with the ULMfit approach as the articles can vary in size, I can have articles which have 10 -30 words and articles which are more than 1000+ words?
What is the best approach in terms of validating the performance for such tasks - should my Training dataset also be representative of the real scenario?
Eg: I have 2000 articles of Class A and 100 articles of Class B
- how should I design the Training dataset ( what should be the percentage representation in terms of the two classes)
Are there any specific loss function and metrics that I should consider, as clearly accuracy wont work in this case. ( I can go for precision and recall, but any specific cases that are used in NLP and in particular for such cases of unbalanced nature.)