I quicky summarize the results. The goal was to compare the ULMFiT sample efficiency to other methods. Howard and Ruder call the ULMFiT method “extremly” sample-efficient in their paper. I’ve got different results for the 10kGNAD.
To evaluate the sample sample-efficiency I trained ten models for nine subset sizes raging from 1% to 100%. I report the average error rate for the fastText library, a Support Vector Machine (SVM), a TensorFlow NN and the ULMFiT method using sub-word tokeniziation.
For the smaller subsets the TensorFlow NN has the highest sample-efficiency, for the larger subsets starting from 10% the SVM outperforms. The ULMFiT method has a higher sample-efficiency only on the 5% subset. I can’t say that the ULMFiT method is “extremly” sample-efficient on the 10kGNAD.
Keep in mind that I was quite limited in terms of GPU power, so someone might be able to find better hyperparameters than I did. Additionally experiments on one dataset are hardly representative for the german language or other languages.
I share my scripts here.