It was just 10c_fp16 from yesterday’s class, so IMAGENETTE_160 with just a couple of basic transformations. I did increase the batch size and the number of epochs. I’ll ensure that libjpeg-turbo is in use and will try something heavier next.
@wgpubs The SentencePiece tokenizer does not add this special tokens, nor spacy does. The tokens are being added in the preprocessing by fastai library. We added them in our recent experiments. If you add such tokens you need to ensure they aren’t broken apart by sentence piece so you need to list them as special tokens in the sentence piece params.
The vocab size should be smaller than with regular tokenization, 25k works well, but models with 15k aren’t much worse. Test few to see which performs best. You can do that on the LM level as it is possible to compare the perplexity between 25k and 15k.
You can find an example implementation in n-waves/ulmfit-multilingual, the repo will be merged to fast.ai once I manage to come up with okey PR.
Sentencepiece has BPE tokenization included you can select that in the params. I like SentencePiece as the API is ok, I haven’t tried BPE.
Just playground it is intended to be merged to fast.ai. Currently, it has SentencePiece implementation and it lets you simply run tests against IMDB and MLDoc. It is still incompatible with the recent fast.ai but I’m planning to fix that.
Few more thoughts. We tested models with sp on MLDoc (9 languages), a model with 25k had a bit better performance on German, and comparable (or a bit worse) on other languages, including Russian which was odd.
It is hard to judge the tokenzier as the ULMFiT classifier performance differs from execution to execution, and to really test something you need to run the model end to end (~2days).
The senentecepiece models are a bit faster to train as the vocab is smaller. Although the trick with changing the vocabulary doesn’t work, so if your initial tokenization didn’t had all the characters (for example Emojis) that you are going to use in your downstream task then you probably want to train a full model end to end on (wikipedia + your corpus).
@mkardas may add a more as he was playing with ulmfit for hate speech detection.
It’s not just that, the 2080 uses the Turing Architecture, which has special cores for fp16 processors, the Maxwell and Pascal architectures (GTX 1080 and prior), will do fp16 math, but no faster than their 32 operations. So to get the speed improvement, you must have the Turing or Volta(really high end) architectures.
Check ulmfit-multlingual it has the tips from Sylvain incorporated, and it has plenty of tools to run transfer learning from different models without need to specify all the params (as they are saved to .json). For the training I had good results with using label smoothing so you may want to train with that as well. Good advice is to find some small dataset on which pretraing is fast to test different hyper-parameters.