an update on the progress of works on ulmfit-multilingual
Getting everything in shape is taking us a bit longer than anticipated but we are getting there. To test our work I wanted to make a set of scripts that let you train your own version of ulmfit from WIkiText-103 to IMDB classification and get the same accuracy as in case of Jeremy scripts.
Even though we still have some issues If you are eager and happy to do some testing and bug fixing you can start experimenting now. The most recent version is in
The classification script is working fine for two tokenization methods: Fastai ‘f’, moses+ fastai preprocessing ‘vf’. I’m testing pure moses ‘v’ right now. There are some issues with pretraining a language model, but I’m hoping it is just a matter of training time (I’ve trained only for 10 epochs, now I’m testing on 20) and we still have some remaining issue with sentence-piece implementation (it needs a bit of testing and love).
The new API is almost done and it lets you do the experiments from the command line or Jupiter notebooks. The experiment folder has two example notebooks.
To run a training form command line run:
$ python -m ulmfit lm --dataset-path data/wiki/wikitext-103 --bidir=False --qrnn=False --tokenizer=f --name 'bs40' --bs=40 --cuda-id=0 - train 20 --drop-mult=0.9
Model dir: data/wiki/wikitext-103/models/f60k/lstm_bs40.m
$ python -m ulmfit cls --dataset-path data/imdb --base-lm-path data/wiki/wikitext-103/models/f60k/lstm_bs40.m - train 20
As I said this is still work in progress hence it is in
refactoring branch, but if you are happy to work do some debugging and testing feel free to start using it now.