This thread is to discuss changes and updates for training ULMFiT models on multiple languages. The idea is twofold:
- We want to make it easy for people to download and train models on Wikipedia (or potentially CommonCrawl in the future) in any language.
- We want to provide a model zoo consisting of pretrained language models in many languages that people can then simply fine-tune for their own applications.
The previous ULMFiT scripts consisted of multiple steps (and separate preprocessing), so we’d also like to streamline the process of training and fine-tuning a language model as much as possible.
Here are a few things I’d like to talk about.
- The LM pretraining script (see here): I think we should keep the script to pretrain the language model as simple as possible. As the Wikipedia datasets are already tokenized using Moses, we don’t need to tokenize in this script. It also seemed cleaner to build the vocabulary and convert the tokens to ids directly in the script.
- The classifier training script (see here): I’m not really sure what’s the best way to preprocess this. I’ve now added the option to do a manual preprocessing like before so that we use the same tokenizer. Ideally, I’d like to use the fast.ai methods for building the vocabulary and preprocessing, but there are some unclear issues (see below).
- Config file: I think it’d be a good idea to save the hyperparameter settings of the trained model to a config file, in order to avoid a parameter mismatch between the saved and loaded model.
The LM pretraining (see here): Using the
TextLMDataBunch.from_idsmethod does not work as
BaseTextDatasethas no attribute
loss_func, which is required during training. I instead created a placeholder class
DataStumpas a work around. It would be nice if this could be handled directly in the method. This could be done with
LanguageModelDatain the previous version, which no longer exists.
The classifier training script (see here): What I find unintuitive when using the fast.ai methods is that when tokenization is used, the special field identifier is automatically prepended to the input (see here). People who are not familiar with fast.ai will probably miss this, which can lead to confusion. It would be good if all the preprocessing and special rules was specified in the tokenizer. The same applies to the vocabulary: The token for unknown words is
<unk>by default in the WikiText datasets and in a lot of datasets, so it’d be nice to keep using this. The fast.ai
Vocabcurrently hardcodes the UNK value (see here).
RNNLearner.classifiershould allow to set the dropout values rather than hard-coding them. For QRNNs, I’d have to use the more verbose formulation of
RNNLearner, which is even more code for defining the classifier.
- After pulling the last changes, I get the following error:
File "/home/ubuntu/fastai/fastai/basic_train.py", line 216, in on_train_begin self.pbar.write(' '.join(self.names), table=True) TypeError: write() got an unexpected keyword argument 'table', which seems related to drawing the progress bar.
Big remaining todos:
- QRNN training is still slow/defective since some recent update. Need to look into that.
- Add subword tokenization and support training with subword units.
- Add training and fine-tuning of bidirectional LM.