Small Data Classification with ULMfit

I am using and ULMFit to do an empirical study with a very low amount of data. I have a total of 1500 data points and about 500 labeled examples(350 train + 150 validation) for binary classification.
Here is what I am doing:

  1. I first used the pre-trained model to do the training
  2. Then used the
    learn =text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5,metrics = [accuracy, error_rate])
    learn.load_encoder(‘fine_tuned_enc’) for the classification.
  3. Used the learn.fit_one_cycle to train the model for a domain-specific task. I did this by learn.freeze_to(-2) and learn.freeze_to(-3) and then unfreezing everything and training it again.

Here are my issues:

  1. I quite don’t understand why I am getting different results every time. I ran the experiment 5 times and I am getting the accuracy from 62% to 68%. The accuracy of the model is not a problem for me given the data size, but the results vary every time. I have seen it go 72 % at times.
  2. I also noticed that learn.freeze_to(-2) and learn.freeze_to(-3) don’t make much difference in this case. But I do see the accuracy sometimes dropping when training at learn.freeze_to(-2) and learn.freeze_to(-3). Does the gradual freezing/unfreezing also depend upon the data size?
  3. I am having to train at higher epochs to get the best model as shown in the collab.

I am quite new to deep learning. I am thinking it may be the data size but I am not sure. The model still performs better than SVM by almost 6%, but I can’t seem to figure out why I get different learning rates, results, and accuracy every time. I thought it was the learning rate but even that changes and I have been choosing a rate quite higher the minimum value. Any help would be appreciated.

Here’s my link to the collab :

Hey Sean,

Thank you for the detailed description. I think you are doing everything right.

Regarding your questions:

  1. I know this behavior myself from working with small datasets. I think the jumps in accuracy are most likely related to your small dataset size. I don’t know exactly what’s the reason behind, but I can imagine that since the model has less data to learn from, it has less chance to pick up the general trends in the data that generalize well to new data. Also, remember that the weights in your classification head are initialized randomly. So given small data and random initial values, your model could end up learning slightly different patterns at each run.
  2. While it has been shown in the original ULMFiT paper that gradual unfreezing helps for transfer learning in NLP there has been more recent research that shows (for Transformers) that full unfreezing generally gives the best results (see T5 paper). I always try both gradual unfreezing (head only, -2, -3, unfreeze full model) and “standard” unfreezing (head only, unfreeze full model) and see what works better. Not sure if/how gradual unfreezing is related to dataset size.
  3. That’s ok. There is no general best choice of number of epochs. Just keep training until you validation accuracy starts getting worse.

Since the learning rate finder is also the result of training (with increasing learning rates) the results can also be changing here. Anyways, try picking a learning rate and re-running your model with the same learning rate to see how your results change from run to run.

Thanks, @stefan-ai for taking the time to answer this and your answers make total sense and even further confirms some of my hypothesis. I did some more research and luckily yesterday I found out that if you use a seeding value and then start the learning process, you get consistent results. I used the code which I found from the reproducibility section in fast ai ( and just like that my results started to be consistent. I am getting an accuracy of 68 % which beats Linear SVC, Multinomial naive Bayes, and logistic regression(which is done through TFIDF) for the data set that I am using . I hope this helps someone :slight_smile:

1 Like

Right, that makes sense. If you’re setting a random seed in PyTorch, the model weights should be initialized with the same values each time. So with that you get exactly the same result at each run?

Nice benchmarking of ULMFiT against more traditional NLP methods. Another model you could try is using naive bayes features as inputs to a logistic regression model. Before ULMFiT, Jeremy demonstrated this method in this kaggle kernel.