A Code-First Introduction to Natural Language Processing 2019

Hello @pierreguillou. It’s really not impressive at all, I simply reduced the Wikipedia dataset successively until I could generate a databunch without the GCP session freezing up on me - Sentencepiece tokenization is more memory intensive than Spacy tokenization.

In the end my German fwd databunch had a size of 939 MB (same size bwd, obviously). Batch size was bs=128 and I don’t think I changed drop_mult from the notebook suggestions (but I can check once home, can’t ssh into GCP from here). I used the default sentencepiece vocab size of 30k.

EDIT: The 939 MB databunch contained 113 Million subword tokens. I followed the “Turkish ULMFit from scratch” notebook, not straying from the suggested drop_mult=0.1 & wd=0.1 .

Thanks @jolackner.

On my side, I reduced by 5 the size of my databunch of the French wikipedia dataset from 5.435 Go to 1.077 Go in order to have a corpus of about 100 millions of tokens as adviced by Jeremy.

Note: if I understand well that using a databunch of 1 Go instead of 5 Go will speed up the ETT (Epoch Traing Time) of my Language Model, I think that I’m loosing a lot of language knowledge. Do you have an opinion on that?

On this base, I’m training today my French LM with the learner parameters values of the nn-vietnamese.ipynb notebook:

# bs = 128
# len(vocab.itos) = 60 000
learn = language_model_learner(data, AWD_LSTM, drop_mult=0.5, wd=0.01, pretrained=False).to_fp16()

After that, I will train it from scratch with the learner parameters values of the nn-turkish.ipynb notebook like you did (ie, drop_mult=0.1 & wd=0.1):

learn = language_model_learner(data, AWD_LSTM, drop_mult=0.1, wd=0.1, pretrained=False).to_fp16()

I will published the results afterwards.

The default model size isn’t big enough to take advantage of it. I believe that @bfarzin has had some success with larger models and a larger corpus, however.

Hello @bfarzin. After Jeremy’s post, I looked in the forum for information about your larger LM model with a larger corpus. I found a post and a blog post from you about using QRNN instead of AWD-LSTM. This is the way to train a LM with a larger corpus? Thanks.

I’m away from my machine this week and can look more in depth next week. But iirc I used titan GTX which has 24 Gb memory. My experience was that SentencdPiece worked well and took far less memory. QRNN also helped on memory but I did not try with fp16 because that was not supported at he time (I think it’s supported now)

You do get better results when you have a bigger corpus in general because you want the tokenization to see as big a vocab as possible.

I also had luck with distributed learning across multiple AWS machines. But that can be expensive to run to reduce time per epoch.

Coming, what is your objective? Build an LM from scratch?

Hi @KevinB. Clearly, it is a good pratice to use tmux to deal with running a Jupyter Notebook on an online GPU. I found in the forum a short howto guide and a post about “Restoring Jupyter client’s GUI (scores and graphs…etc) after client disconnects for long time” (and a “tmux shortcuts & cheatsheet” page).

About GCP: as the notebook server is launched with the instance in GCP, I’m not sure if tmux is needed (a ssh disconnection will not stop the notebook server) but it is good practice. I do the following:

  1. SSH connection to my fastai instance:
    gcloud compute ssh --zone=MY-ZONE-INSTANCE jupyter@MY-FASTAI-INSTANCE

  2. Launch of my tmux session:
    tmux new -s fastai

  3. Creation of 2 vertical panes:
    ctrl b + %

  4. Launch of the notebook server on port 8081 (as the port 8080 is already used):
    jupyter notebook --NotebookApp.iopub_msg_rate_limit=1e7 --NotebookApp.iopub_data_rate_limit=1e10 --port 8081 --no-browser

  5. Detach my tmux session
    ctrl b + d

  6. Stop my ssh connection:

  7. SSH connection again to my fastai instance but now with association between the port 8081 of the jupyter notebook server on my GCP fastai instance and my local port 8080 on my computer
    gcloud compute ssh --zone=MY-ZONE-INSTANCE jupyter@MY-FASTAI-INSTANCE -- -L 8080:localhost:8081

That’s it. I can run my notebooks from my GCP fastai instance at http://localhost:8080/tree on a web browser. If my ssh connection stops, my Jupyter server is still running in the tmux session on GCP. I have only to redo the step 7 above and I get back my notebooks which were not stopped by the ssh disconnection.


Nice, that looks like great information! I don’t use GCP myself, but that information seems super useful for everybody that does.

One think you might want to look into is tmuxp which can do a lot of that pre-setup for you. You can read more about it here

So in this lesson https://github.com/fastai/course-nlp/blob/master/6-rnn-english-numbers.ipynb

I am bit confused about the target section

def loss4(input,target): return F.cross_entropy(input, target[:,-1])
def acc4 (input,target): return accuracy(input, target[:,-1])

Why is target given as target[:,-1]? I am bit confused about what’s happening under the hood here?

If you are looking for a labeled dataset in English US, English UK, French, German or Japanese to train and test your ULMFiT classifier, you can download the Amazon Customer Reviews Datasets.

If you need help, I published a download guide.

Thanks for your answer. In order to understand the NLP course, I run all the notebooks. This helps to understand the course and in addition, the problems encountered allow to understand the limits of the LM models (a LM model even through ULMFiT is not magic, it will not generate great texts, it takes time to be trained, IRL there are ssh disconnections when training a model for a long time, sometimes it is hard to launch your GPC instance because “GCP does not have enough resources available to fulfill the request”, you can change the LM architecture from AWD-LSTM to QRNN but for what benefits and how, the same for tokenizing from SpaCy to SentencePiece but for what benefits and how, etc.).

I did succeed to train a French LM from scratch with a 100 millions tokens dataset by using the nn-vietnamese.ipynb notebook (I will publish it) but now, I would like to get a better LM to test for example the generation of synthetic text. That’s why I’m trying to use all my Wikipedia dataset (almost 500 millions tokens) but for that I need a deeper LM or a different one (GPT-2, BERT…) or/and others parameters or/and more GPUs. Any information/advice is welcome. Thanks.

1 Like

Efficient multi-lingual language model fine-tuning Written: 10 Sep 2019 by Julian Eisenschlos, Sebastian Ruder, Piotr Czapla, and Marcin Kadras


Hi @pierreguillou - regarding your question about benefits of SpaCy vs SentencePiece tokenization: the article’s (see Joseph Catanzarite’s post above) authors Eisenschlos, Ruder, Czapla, Kardas, Gugger and Howard come to the conclusion that for their new “MultiFiT” model (with 4 QRNN layers):

Subword tokenization outperforms wordbased tokenization on most languages, while being
faster to train due to the smaller vocabulary size.

The advantage seems more marked for Italian and German than for French. They found a Sentencepiece vocabulary size of 15k to be performing best and trained their multilingual models on 100 million tokens.


[ EDIT 09/23/2019 ] : I did finally trained a French Bidirectional Language Model with the MultiFiT configuration. It overpasses my other LMs (see my post) :slight_smile:

Thanks @jcatanza and @jolackner. I agree with you: this MultiFiT configuration (see last page of the paper) must be tested and should give better results than the ULMFiT one.

If I found time, I will (by the way, I’ve already published my results using the ULMFiT configuration for a French LM and Sentiment Classifier).

1 Like

Hello, folks of this awesome community.
I am following the lectures of the NLP course at the moment. That’s how I learned about fast.ai in the first place. Now that I am learning more about the community I got super interested in going through the other course(Practical Deep Learning, not sure about the name). I am especially excited about starting to learn Swift! Do you have any suggestions on how to best follow the material? Finish the NLP course and the start Part 1 of Deep Learning? Or just straight into Part 2? Or just pick and choose? Thanks for any suggestion.

Update: I am at lesson 11 of the NLP course.

I really enjoyed going through this course over the summer and have been inspired to publish my first Medium post. It’s about using a pre-trained BERT model to solve English exam ‘gap-fill’ exercises. As a sometimes English teacher I was interested to see how the language model performed on a task that students struggle with. Spoiler: BERT did amazingly well, even on very advanced level exams.


Welcome! I would recommend part 1, this will allow you to start using deep learning for your own projects. Save part 2 for later when you want to dig deeper into the code and understand more about how the library works. The Swift developments are exciting if you are keen on code but part 1 is the place to go to learn first of all how to do deep learning.

Congratulations on your progress with the NLP course. I put the audio on my phone and repeatedly listened to the classes while running on the beach over the summer until I felt I understood the whole picture. Code-first away from a computer!


Thanks Alison, I will keep this in mind when I am done with the NLP course.
Meanwhile I developed my own little and cute iOS app to learn Swift :stuck_out_tongue:
I like your innovative approach to consuming the course material, I bet the nice viewing helped!

1 Like

Hi all, i’m studying the notebook 3-logreg-nb-imdb.ipynb in colab but i’ve a strange error in this cell

x_nb = xx.multiply(r)
m = LogisticRegression(dual=True, C=0.1)
m.fit(x_nb, y.items);

val_x_nb = val_x.multiply®
preds = m.predict(val_x_nb)

the scipy version in colab are 1.3.1 and the xx shape are (800, 260374) and r are (260374,).
The error is

ValueError: column index exceeds matrix dimensions

at row x_nb = xx.multiply®

Can someone help me ?

/usr/local/lib/python3.6/dist-packages/scipy/sparse/coo.py in _check(self)
285                 raise ValueError('row index exceeds matrix dimensions')
286             if self.col.max() >= self.shape[1]:

–> 287 raise ValueError(‘column index exceeds matrix dimensions’)
288 if self.row.min() < 0:
289 raise ValueError(‘negative row index found’)

Thanks a lot

Hello @jeremy

I have a quick question on your Text generation algorithms (NLP video 14)
How about taking something of both worlds (beam’s context awareness and sampling’s surprising-ness) by baking sampling inside of each Beam Search Step? Has anyone tried this before?

I tried this while training a language model on my own writing and it was the most realistic among all generation decoders I tried.

Here are the samples from my experiments, input this is about


This is about the problem is to be a lot of statement to a state of the strength and weakness of the problem is to be a lot of statement


This is about speeches naws, because we do pers customer more. doing my better. 1. i could intelligence ⢠be resporl new 10-2019 practical aug 27 - 47 meeting.

Beam Search

This is about this text for? what do i expect to find in the text? is there any major empirical conclusion and reading after reading what i have (or have not) understood? (make your own note of the text? is there any major empirical conclusion and reading after reading what i have (or have not) understood?

Beam Search + Sampling

This is about the problem with the company and the problem and personal information and reading what i have (or have not) understood? (make your own note of the decision making and complexity is a consulting in the text? is there any major empirical conclusion and reading what are the main ideas presented in the company and then and the problem and predictions and problems and discussion of the problem and responsibilities and discussions and then and have the relationship of the company to learn from the problem with the world.

Thanks for all the great lectures.


Tiny comment about Implementing a GRU (NLP video 15).

hidden state = z_t ⊗ n_t + (1 - z_t) ⊗ h_t-1
where n_t is newgate and z_t is updategate

updategate*h + (1-updategate)*newgate

I guess that pragmatically the end up in the same thing but the slide version update the hidden state with the “opposite” of the updategate :slight_smile: