BERT with ULMFiT in FastAi

Hello, I started with Machine Learning about a year ago and writing my master thesis about Transfer Learning with the main focus being to train a model with ULMFiT and a limited amount of Data. My goal is to prove that ULMFiT with the FastAi library can be done easlily and quickly by beginners like me without sacrificing too much time on training and hyper parameter tuning and still getting decent perfomance on different tasks.

The strange thing is that basic fine tuning of the classifier mostly ends up with similar or even better results then the one trained with ULMFiT method. The possible problem that could lead to those results is that I am heavily relying on the lr_find funtion since I have to train multiple BERT models to evalute all variations of my Data Sets (IMDB, SST-5 and a self labeled CounterStrike News article set) without spending too much time on choosing the right learning rate. Those data sets get split into 100/75/50/25% of their size to see how much impact the data size has on the performance.

Here is the colab notebook I wrote to evaluate the different transfer learning methods. Does anyone has a suggestion where the problem could be within my implementation of ULMFiT? My guess is that I probably should spend more time on Hyperparameter tuning but this would contradict to the purpose of my thesis which is that it should be useable quickly.

Sorry I don’t fully understand, are you saying you’re comparing ULMFiT to BERT? BERT will outperform ULMFiT in many cases. From what I remember, ULMFiT can have a small advantage over BERT for data from very different domains where you have little data…but generally BERT will outperform

1 Like

Oh I am sorry if i did not clarify it good enough. I am actually using the transfer learning methods like gradually unfreezing, discriminative learning rates and slanted triangular learning rates for fine tuning a BERT model. So based on the concept of ULMFiT I am using it on BERT and want to see if it improves the performance compared to a BERT model that is only fine tuned by adding a classifier and training the whole model for 3-5 Epochs.

1 Like

Ah gotcha, interesting work!

From the little I have done and from what I’ve seen people here try, fine-tuning BERT with the nice fastai tricks hasn’t had a dramatic difference over a more naive fine-tuning approach. Having said that I would love to see a fully analysis to prove/disprove that idea…

One thing I haven’t explored but would love to see is whether using the Mish activation function and/or Ranger optimizer in BERT would yield any improvements. (But to use Mish you are technically changing the architecture slightly so maybe thats out of scope for you)

1 Like

Thanks! I will probably make a post with the evaluation of the results when I am finished with my thesis by the end of September.

From what I´ve seen so far you are probably right. It does not make a big difference. I only noticed that when you scale down the amount of data, the basic fine tuning starts to struggle and the performance decreases whereas the ULMFiT approach almost keeps the same performance.

Thanks for the input I have not known about Mish. I might have a look into this activation function when there is enough time but like you said changing the architecture is not really desired since I want to keep a very basic approach to prove the effectivness of ULMFiT when applied to BERT.

Thats already a nice finding…

It sounds like you’re keeping focussed on ULMFiT approaches (which is good, I struggle with focus sometimes :stuck_out_tongue:), but if you have time, testing with Ranger could be a super nice addition. Lots of folks here use it as their default for computer vision now, and ULMFiT was inspired by techniques that worked in vision, so…

if @LessW2020 (ranger creator) has time he might be able to give you a few pointers…

(ok ok, ranger sales pitch over now I promise :wink: )

1 Like

Thanks for the detailed information! Really helps me out by a lot since the only thing i had on my mind the past 2 months was ULMFiT :smiley:

I will definitly try it with Ranger and compare them to the data I already have.

1 Like

Feel free to ping me here on this thread if you need any pointers with Ranger too.

Combining Ranger with fit_flat_cos ( has tended to work better than fit_one_cycle for vision tasks, but my results have been mixed when using it with transformers. Worth experimenting tho.

1 Like