Building a Multiclass Text Classifier on Industry Specific Data

I’m looking into building a custom multi class text classifier. There is a max of 300 classes with close to 8000 samples. Each sample contains roughly 25-100 words and is mostly written in abbreviations. Think nurses taking notes on abbreviations. Classes are highly imbalanced.

Without mucking about with the data, I wanted to see how well a pretrained Bert model from HF would perform. It was able to achieve 58% accuracy. Not bad I suppose, but I think it was able to get this score because it just learnt to classify the dominant classes well enough.

Because my data is so different from the corpus the pretrained model was trained on, should I be using a custom tokenizer built for my specific dataset and build a custom language model and use that for text classification?

One of Jeremy’s LM tricks was to do next word pretraining on your text before training the classifier. You can weight your classes and/or over sample your less represented classes to try and balance things out. If you tokenize and then untokenize is your original text preserved? It should be. If your text still resembles normal language it is highly likely that starting with a general pretrained model is going to significantly outperform training from scratch given limited dataset.