I see, although we use accuracy for our evaluations, maybe you can cut out a balanced test data set?
Do you have any similar corpus that would have sentences with sentiment. It does not have to have labels, Tweets would be fine, or product reviews / comments.
If so you could finetune LM on that data set and you should get much better results. It would be interesting to see how much you can improve.
Hey, I’ve the notebook and github repo to reflect that the above results [89% accuracy and ~60 kappa score] for classification have been obtained on validation set which had ~84% negatives and ~16% positives. Do you think that would be helpful while ensuring reproducibility?