A Code-First Introduction to Natural Language Processing 2019

For fp16() all matrixes have to have dimensions divisible by 8. So, we add tokens that are “fake” to get up to 6016 which is divisible by 8.


Thanks @bfarzin for you great answer.

I found the code of your explanation in the fastai libray from line 155 to 157 in the file transform.py:

[155] itos = itos[:max_vocab]
[156] if len(itos) < max_vocab: #Make sure vocab size is a multiple of 8 for fast mixed precision training
[157]       while len(itos)%8 !=0: itos.append('xxfake')

Hi, in the notebook 3-logreg-nb-imdb.ipynb from the video 4, Rachel introduces the coefficient b to get the predictions on the valid dataset in the naïve Bayes sentiment classifier (see screenshot): why?

@pierreguillou Talking completely off the top of my head (I’ve not seen the video):

b is log-likelihood ratio of positive and negative class populations.
And it appears to play the role of a bias term in a linear model.

So I think b might be a correction for bias due to class imbalance.

Hello, I’m wondering what platform jeremy used to train the notepad vietnamese-nn.ipynb with an epoch time inferior to 30mn?

Since training an LM from scratch (for languages other than English) requires lot of resources (because of the corpus size), it would be nice to explain what GPU configuration is needed in AWS, GCP… (ie, in fastai GPU tutorials: https://course.fast.ai/gpu_tutorial.html).

If the answer has already been given in the fastai forum, thanks to give the link to the post (cc @jeremy).

I would have trained it on our university computer, which has Titan RTX cards.


I was wondering the same (while trying to apply those great NLP notebooks to a bidirectional German LM). I found that Vietnamese has about 1/3 less articles than French or German Wikipedia - but in the end I still downsized my ambition to just 20% of those 2.3 Million German Wikipedia entries on a GCP P4… :slight_smile:

1 Like

Hello @jolackner. I think this is an important point for Rachel’s NLP course to really be open to everyone (I mean: to be able to run all notebooks from course-nlp github).

For my part, I spend a lot of time on the GCP platform to get a fast but inexpensive instance to train a Language Model from scratch (in fact, in French and Portuguese = large corpus). But every time I think I found the right configuration, a problem occurs during the training (last problem: SSH connection stopped by GCP).

From experimented fastai users, I would love to get a tutorial on the instance configuration needed to train a LM from scratch on GCP and AWS for example, with a corpus similar to the English one (it means: a huge corpus!).


I totally agree, @pierreguillou . Would be great to get expert info on which instance / memory type should be appropriate in order to successfully train a full Wikipedia LM - I ran into quite some memory errors on GCP, and yes, my preemptible instance was terminated a couple of times before I managed to finish training.
On a side note I think that the Language Model Zoo should be more populated. Let all those beautiful (and smaller) languages roar and be made amenable to NLP tasks thanks to fastai!

1 Like

Thanks Jeremy. I imagine that using an NVIDIA V100 on GCP would give a similar result but the problem is the stability of the SSH connection to the cloud instance (using a university network helps for that). Do you have any tips to help train a LM from scratch on GCP (with a Wikipedia corpus size similar to the English one like the French one)?

The stability of the ssh connection shouldn’t matter. Just make sure you’re always running in a tmux session, so you can always re-connect later.

1 Like

I believe that @piotr.czapla and friends are working on that at the moment! :slight_smile:


tmux is magic. I resisted using it for so long because I thought the setup was going to be annoying and the only thing I had to do was apt-get install tmux and it started working for me. No more Jupyter sessions being killed because I shut my laptop lid!


Thanks Jeremy. Thanks to your answer, I found in “Known issues” on the GPC website the following warning that confirms both my problem and your tmux solution:

**Intermittent disconnects** : At this time, we do not offer a specific SLA for connection lifetimes. Use terminal multiplexers like [tmux](https://tmux.github.io/) or [screen](http://www.gnu.org/software/screen/) if you plan to keep the terminal window open for an extended period of time.

I hope this will also help other fastai users to train DL models online on huge datasets like LMs from scratch.

[ EDIT ] For people without technical background : the idea is to launch your jupyter notebook from a session opened online (on GCP for example) through tmux and not from your ubuntu terminal in your computer. Thus, even if your ssh connection stops, the session used to launch your jupyter notebook is still running online :slight_smile: More information in this post.


Could someone clarify why vocab.stoi and vocab.itos have different lengths? I’ve watched rachel’s video a couple of times and i’m still unclear. I understand that vocab.itos has all the unique words but then doesn’t vocab.stoi also have all the unique words?

Hi @kausik. You already gave halh of the answer. The whole answer is in the fastai text.transform page.

text.tranform contains the functions that deal behind the scenes with the two main tasks when preparing texts for modelling: tokenization and numericalization.

Tokenization splits the raw texts into tokens (which can be words, or punctuation signs…). The most basic way to do this would be to separate according to spaces, but it’s possible to be more subtle; for instance, the contractions like “isn’t” or “don’t” should be split in [“is”,“n’t”] or [“do”,“n’t”]. By default fastai will use the powerful spacy tokenizer.

Numericalization is easier as it just consists in attributing a unique id to each token and mapping each of those tokens to their respective ids.

For example, these 2 tasks are done by the factory method from_folder() of the TextList class when you create the databunch of the corpus thanks to the Data Block Api.

data = (TextList.from_folder(dest)
            .split_by_rand_pct(0.1, seed=42)
            .databunch(bs=bs, num_workers=1))

If you look at the source code of the from_folder() method, you see for example the parameters max_vocab = 60000 (no more than 60 000 tokens in the vocabulary) and min_freq = 2 (a token is kept in the vocabulary if it appears at least twice in the corpus).

With these parameters (but there are others), we can understand that the vocab.itos which is the list of unique tokens is constrained (limited to 60 000 tokens with the highest frequency of occurrence, etc.) and then smaller in size than the vocab.stoi dictionary which contains all the tokens of the corpus.

Finally, the vocab.stoi dictionary has tokens as keys and their corresponding ids in vocab.itos as values. Thus, all tokens not belonging to the vocabulary will be mapped to the id of the special xxunk token (unknown).

Hope it helps.


Hi @jolackner,

As I want to train a French LM on GCP, I’m searching for the right configuration and in particular the training GPU time I will face.

I found in your link to Wikipedia articles count that in the last count (dec. 2018), there was 1.75 more articles in French (2.1 M) than in Vietnamese (1.2 M). However, it does not mean that the training of my French LM will be 1.75 bigger than the Vietnamese one.

In fact, your post gave me the idea to compare not the number of Wikipedia articles but my French databunch with the Vietnamese one created in the nn-vietnamese.ipynb notebook of Jeremy (note: the 2 databunches are created with nplutils.py from the course-nlp github).

Vietnamese databunch (bs = 128)

  • number of text files in the docs folder = 70 928
  • size of the docs folder = 668 Mo
  • size of the vi_databunch file = 1.027 Go

French databunch (bs = 128)

  • number of text files in the docs folder = 512 659 (7.2 more files)
  • size of the docs folder = 3.9 Go (5.8 bigger)
  • size of the fr_databunch file = 5.435 Go (5.3 bigger)

If we use only the databunch size as ratio and with all notebooks parameters identical and same GPU configuration as Jeremy, the 28mn30 by epoch for the training of the Vietnamese LM learner should be 28mn30 * 5.3 = 2h30mn by epoch to train the French LM learner.

I started with one NVIDIA Tesla T4 (batch size = 128) but the epoch training time (ETT) was about 6h.
Then, I’m testing one NVIDIA Tesla V100 with the same bs and my ETT decreased to 2h10mn (see screen shot).

Note: Jeremy said that he used a TITAN RTX from the SF university but this GPU does not exist on GCP.

Great? Yes in terms of ETT but I’m still facing hard time with GCP. From the third epoch, nan values began to be displayed (see screen shot). For info, I’m using learn.to_fp16() and an initial Learning Rate (LR) of 1e-2 that was given by learn.lr_find() (see screen shot) but in reality 1e-2 * (128/48) = 2.6e-2 as I followed the code of Jeremy.

learn = language_model_learner(data, AWD_LSTM, drop_mult=0.5, pretrained=False).to_fp16()

I guess that the nan values mean that my losses diverge (ie, LR too high)?
So my question to @rachel and @jeremy: why do we need to scale our LR? Should I keep my LR of 1e-2 ? Thanks.

Training can be a little more flaky with fp16. Try making your LR 10x lower and see how it goes.

Thank you @pierreguillou for pointing out that the databunch size is much more relevant for training duration!

Re: the learning rate - I trained my German Wikipedia language model with lr = 3e-3 (and additionally scaled by batch size bs/48) after seeing similar problems with exploding losses.
My lr finder curve looked similar to your curve, although I ran a Sentencepiece tokenization, as German is heavy on concatenated words (“Donaudampfschifffahrtsgesellschaftskapitän” is a favourite).

After 10 epochs (GCP, P4, fp16, 2hrs/epoch), I got to 43.7% accuracy with the fwd model…

… and 48% accuracy with the backwards model (I restarted training, printout from last 5 epochs):

I’d like to better understand what the higher backwards prediction accuracy says about the German language (Jeremy mentioned in the videos that Vietnamese seems easier to predict backwards than forwards) and what use could be made of that. For a downstream regression task it didn’t seem to make a big difference so far.

Hello @jolackner. I’m impressed by your “2 hours by epoch on a P4”!
Could you tell us more about your data and parameters ? (size of vocab.itos and vocab.stoi, size of your dataset used to create your databunch, batch size, drop_mult in your learner…). Thank you.

Note: I asked as well yesterday a question relative to the dataset size to be used for training in the Language Model Zoo thread.