A Code-First Introduction to Natural Language Processing 2019

@rachel @jeremy I just listened to the first lesson in the new Fastai course “A Code-First Introduction to Natural Language Processing”. I’m both grateful and excited to use this course to learn about NLP! Thanks so much for creating the course and making it available online!

Is there a “Category” in the Fastai Forum for this course? If not, could you please create one, or let me know the appropriate category for posts relating to the new course? Again, thank you so much!

edit from Jeremy: let’s use this thread for discussion, and if it’s popular, we’ll create separate category

10 Likes

I second that suggestion.
I tried executing the first few jupyter notebooks and there are a few bugs. In the past, with machine learning and deep learning courses there was a specific forum to report and discuss resolution of those bugs.

2 Likes

Until we have a Category for the new course, we’ll use this thread for related posts.

Here is the announcement for the new course.

The course videos are on youtube.

The course notebooks are on github.

6 Likes

Preparing to run the course notebooks

I’ll assume that you have a working installation of fastai V1.0 under the latest anaconda (version 2019.03 with build channel py37_0), and that you have activated the environment you created for fastai.

Follow these steps to prepare your environment to run the first course notebook:

(1) Install the course materials from github
git clone https://github.com/fastai/course-nlp.git

(2) Install scikit-learn, a Python machine learning library
conda install scikit-learn

(3) Install nltk, the Natural Language Toolkit, a library for Natural Language Processing that is widely used for teaching and research.
conda install -c anaconda nltk

(4) Install spaCy, a library for “Industrial-Strength Natural Language Processing”
conda install -c conda-forge spacy

(5) Download an English language model for spaCy
python -m spacy download en_core_web_sm

(6) Install fbpca, a library for “Fast computations of PCA/SVD/eigendecompositions via randomized methods”
pip install fbpca

After this, you should be able to run the first notebook with code, which is 2-svd-nmf-topic-modeling.ipynb

I will continue to update this post in case the infrastructure needs to be extended in order to run subsequent notebooks in the course.

16 Likes

Clouderizer template for fastai NLP:

1 Like

I am wondering whether anyone have expertise/steps on how to deploy a NLP model to production(like a web site or mobile APP)? Thanks first!

I’ve used Render with good experiences. Here’s a workshop I did for my class/study group this week on deployment. Nlp is in there too. Deployment on Render UWF

2 Likes

Hi, Zachary Thank you very much for sharing, which are very valuable to me. I am just wondering try to follow your example for “Cats vs Dogs”, when I try to run learn.export(pets.pkl) and get a pkl file, my question is where can I find that “pets.pkl” file? I am running this notebook at google Colab’s. Thanks again!

Hey Anthony, the notebook is just there as a general guide to show where to modify the serve.py file. (Or server.py). I had my students run lesson 1 in the background. Use it as a reference, not plug and run in the notebook :slight_smile: however you should see where I change learner’s path. I believe you may have skipped that.

Though in the future I’d direct questions like this to a separate thread as to not lead away from the threads original direction.

Hi, in the notebook 3-logreg-nb-imdb.ipynb from the video 4, Rachel shows that the vocab.itos has 6010 tokens and displays the last one as sollett.

However, when I run the notebook, I get 6016 tokens and the last 6 ones are xxfake (see screen shot below).

How is it possible to get in vocab.itos more than one time the same token? (and what means the token xxfake?)

For fp16() all matrixes have to have dimensions divisible by 8. So, we add tokens that are “fake” to get up to 6016 which is divisible by 8.

5 Likes

Thanks @bfarzin for you great answer.

I found the code of your explanation in the fastai libray from line 155 to 157 in the file transform.py:

[155] itos = itos[:max_vocab]
[156] if len(itos) < max_vocab: #Make sure vocab size is a multiple of 8 for fast mixed precision training
[157]       while len(itos)%8 !=0: itos.append('xxfake')
4 Likes

Hi, in the notebook 3-logreg-nb-imdb.ipynb from the video 4, Rachel introduces the coefficient b to get the predictions on the valid dataset in the naïve Bayes sentiment classifier (see screenshot): why?

@pierreguillou Talking completely off the top of my head (I’ve not seen the video):

b is log-likelihood ratio of positive and negative class populations.
And it appears to play the role of a bias term in a linear model.

So I think b might be a correction for bias due to class imbalance.

Hello, I’m wondering what platform jeremy used to train the notepad vietnamese-nn.ipynb with an epoch time inferior to 30mn?

Since training an LM from scratch (for languages other than English) requires lot of resources (because of the corpus size), it would be nice to explain what GPU configuration is needed in AWS, GCP… (ie, in fastai GPU tutorials: https://course.fast.ai/gpu_tutorial.html).

If the answer has already been given in the fastai forum, thanks to give the link to the post (cc @jeremy).

I would have trained it on our university computer, which has Titan RTX cards.

2 Likes

I was wondering the same (while trying to apply those great NLP notebooks to a bidirectional German LM). I found that Vietnamese has about 1/3 less articles than French or German Wikipedia - but in the end I still downsized my ambition to just 20% of those 2.3 Million German Wikipedia entries on a GCP P4… :slight_smile:

1 Like

Hello @jolackner. I think this is an important point for Rachel’s NLP course to really be open to everyone (I mean: to be able to run all notebooks from course-nlp github).

For my part, I spend a lot of time on the GCP platform to get a fast but inexpensive instance to train a Language Model from scratch (in fact, in French and Portuguese = large corpus). But every time I think I found the right configuration, a problem occurs during the training (last problem: SSH connection stopped by GCP).

From experimented fastai users, I would love to get a tutorial on the instance configuration needed to train a LM from scratch on GCP and AWS for example, with a corpus similar to the English one (it means: a huge corpus!).

2 Likes

I totally agree, @pierreguillou . Would be great to get expert info on which instance / memory type should be appropriate in order to successfully train a full Wikipedia LM - I ran into quite some memory errors on GCP, and yes, my preemptible instance was terminated a couple of times before I managed to finish training.
On a side note I think that the Language Model Zoo should be more populated. Let all those beautiful (and smaller) languages roar and be made amenable to NLP tasks thanks to fastai!

1 Like

Thanks Jeremy. I imagine that using an NVIDIA V100 on GCP would give a similar result but the problem is the stability of the SSH connection to the cloud instance (using a university network helps for that). Do you have any tips to help train a LM from scratch on GCP (with a Wikipedia corpus size similar to the English one like the French one)?