Hi @jolackner,
As I want to train a French LM on GCP, I’m searching for the right configuration and in particular the training GPU time I will face.
I found in your link to Wikipedia articles count that in the last count (dec. 2018), there was 1.75 more articles in French (2.1 M) than in Vietnamese (1.2 M). However, it does not mean that the training of my French LM will be 1.75 bigger than the Vietnamese one.
In fact, your post gave me the idea to compare not the number of Wikipedia articles but my French databunch with the Vietnamese one created in the nn-vietnamese.ipynb notebook of Jeremy (note: the 2 databunches are created with nplutils.py from the course-nlp github).
Vietnamese databunch (bs = 128)
- number of text files in the docs folder = 70 928
- size of the docs folder = 668 Mo
- size of the vi_databunch file = 1.027 Go
French databunch (bs = 128)
- number of text files in the docs folder = 512 659 (7.2 more files)
- size of the docs folder = 3.9 Go (5.8 bigger)
- size of the fr_databunch file = 5.435 Go (5.3 bigger)
If we use only the databunch size as ratio and with all notebooks parameters identical and same GPU configuration as Jeremy, the 28mn30 by epoch for the training of the Vietnamese LM learner should be 28mn30 * 5.3 = 2h30mn by epoch to train the French LM learner.
I started with one NVIDIA Tesla T4 (batch size = 128) but the epoch training time (ETT) was about 6h.
Then, I’m testing one NVIDIA Tesla V100 with the same bs and my ETT decreased to 2h10mn (see screen shot).
Note: Jeremy said that he used a TITAN RTX from the SF university but this GPU does not exist on GCP.
Great? Yes in terms of ETT but I’m still facing hard time with GCP. From the third epoch, nan values began to be displayed (see screen shot). For info, I’m using learn.to_fp16()
and an initial Learning Rate (LR) of 1e-2 that was given by learn.lr_find()
(see screen shot) but in reality 1e-2 * (128/48) = 2.6e-2 as I followed the code of Jeremy.
learn = language_model_learner(data, AWD_LSTM, drop_mult=0.5, pretrained=False).to_fp16()
I guess that the nan values mean that my losses diverge (ie, LR too high)?
So my question to @rachel and @jeremy: why do we need to scale our LR? Should I keep my LR of 1e-2 ? Thanks.