I’m looking into building a custom multi class text classifier. There is a max of 300 classes with close to 8000 samples. Each sample contains roughly 25-100 words and is mostly written in abbreviations. Think nurses taking notes on abbreviations. Classes are highly imbalanced.
Without mucking about with the data, I wanted to see how well a pretrained Bert model from HF would perform. It was able to achieve 58% accuracy. Not bad I suppose, but I think it was able to get this score because it just learnt to classify the dominant classes well enough.
Because my data is so different from the corpus the pretrained model was trained on, should I be using a custom tokenizer built for my specific dataset and build a custom language model and use that for text classification?