Beginner 'recipe' for working through a text classifier

Hello everyone - I finished the part 1 course and some parts of the part 2 course as they relate to text data. The project I’m working on requires me to be able to classify medical documents. There are 3 classes, and each document has only one classification. I am using a fine-tuned language model as a backbone for the classifier head.

After initial processing of the data into csvs, I’ve used fastai v1 language_model to fine-tune a language model from pretrained wikitext-103 on a publicly available set of de-identified discharge summaries. I found a learning rate through lr_find, and trained it for a couple epochs.

I then took some time to label a small proportion of the documents appropriately. I created a RNN classifier, also with the fastai v1 through classifier, and was able to get it working. Unfortunately, my accuracy hovered around 50%. I’m sure a large part of this is due to the fact that I have only labeled a small portion of the dataset (80 examples for the ‘negative’ class, 20 examples each for two ‘positive’ classes).

I tried to spend some time reading through the ULMfit paper and working through the docs, but I feel a little overwhelmed at the moment. I would love some pointers to get me headed in the right direction. My most pressing questions are the following:

  • Is there a systematic way that I should approach adjusting the hyperparameters? If so, are there any articles or papers that I could read that would help me?

  • For those more well-versed in NLP, do you have a basic algorithm or recipe for tweaking your classifier models?

  • As it pertains to ULMfit in particular, how can I know if I’m overfitting the language model? I’m not sure if I can really compare train-valid loss in this case, because the real utility of the language model for my purposes is a downstream task.

I hope it doesn’t sound like I’m trying to have anyone solve my problems for me! I feel like there is a lot I can learn through this project and I just want to start to get a feel for a systematic approach to solving it.

Thanks!

When you are working with real project you should probably no start with something as complex as ULMfit.
I have been on similar journey as you and below are some important things,

  1. collect data and label them as much as you can.
  2. try some simple approaches first like
    i) bag of words + naive bayes ,Rforests,simple multi perceptron classifier
    ii)tf-ldf + naive bayes ,Rforests,simple multi perceptron classifier
    iii)use pre trained word embeddings(glovec , word2vec) + naive bayes ,Rforests,simple multi perceptron classifier

Please spend good amount off time in getting data and labeling.If you are having trouble then find similar documents from web and use them as data.

try to evaluate your mileage with above 3 approaches , if you are unsatisfied then you can go ahead and try simple lstm models or more advanced things like ulmfit

If we find similar documents from the web to use as data as you suggested, how should we convert this to csv file for ULMFIT to read?