Help with a single-label text classification task

Hi all,

Been through the course lectures and thought I’d give some of the techniques a try on my own data. The problem I’m facing is trying to predict one of ~2500 labels from a short snippet of medical text. I have a few million examples, but the dataset is very unbalanced. The texts are anywhere from 4 characters to ~10-20 words or so (500 characters max). Additionally, there is quite a bit of misspelling.

I’ve been using the ULMFiT architecture and tried the Transformer architecture with pre-trained weights and have been getting okay accuracy. I think the task shouldn’t be too difficult to get a decent score looking at the results and I do get okay results ~50% accuracy and 80% top-5 accuracy. I’ve been doing what looks to be the standard fine-tuning procedure of fitting the language model then slowly unfreezing layers and training them up.

Any suggestions on how to push this to better accuracy?

Here are a few things I’ve been thinking about:

  • Starting with no pre-trained weights. Might overfit to the text a little bit, but my text is quite different from wikipedia text, for example.
  • Using a completely different architecture like a character transformer or something. My texts have a lot of misspelling and I’m a bit unclear if ULMFiT and the Transformer architecture would handle these that well. I know there’s a character portion of them to break down the tokens… but I haven’t seen much written about this topic.
  • Training with a balanced version of the dataset by removing many examples. Haven’t seen much written about this either, but seems like it could encourage some differentiation between the classes.

Any recommendations or suggestions are welcome! Thanks in advance for the help.