A walk with fastai2 - Text - Study Group and Online Lectures Megathread

Please use this topic for the NLP portion of A walk with fastai2!

FAQ:

Videos:

Link to the stream:

I will update this link a few hours before each week. Here is the one for this week:
They will not go live until 5pm cst

Notebooks we have covered:

01_Intro

Schedule

this schedule is subject to change

Block 3, Text (April 8th - Early May (TBD)):

  • Lesson 1: Introduction to NLP and the LSTM
  • Lesson 2: The LSTM, More on Tokenizers, Ensembling
  • Lesson 3: Other State-of-the-Art NLP Models
  • Lesson 4: Multi-Lingual Data, DeViSe

Closing notes

This will be my first time live-streaming so this will be an experiment for everyone but I have high hopes that this will turn out to be a successful study group with your help! Please use this thread for any questions and starting discussions about this material, we’re all learning fastai (and especially the second version) together! I will update this post with youtube links to the livestream, as well as post on this thread as well. Looking forward to seeing everyone next month!!!

(Also minor PSA, this is in no way for any credit whatsoever. I am just an undergraduate student wanting to help others learn how to use this amazing library to its fullest potential. Instead of worrying about credit, try using what you’ve learned into a project or two and some blogs, this provides evidence you know the material much better than a slip of paper can in some cases :slight_smile: )

6 Likes

Okay guys so here is the link to today’s lesson:

We’ll be covering a lot in the first lesson so we’ll mostly just be looking at the API and a hint on the major architecture fastai uses, and next lecture we’ll go full in-depth on an LSTM :slight_smile:

4 Likes

We’re live, sorry about that!

I am trying to replicate this notebook on Kaggle Kernel on a different dataset, the TextDataLoaders generation tends to run on CPU even when GPU is enabled. Is the default setup as CPU for text api ?

Additional the kernel dies, because TextDataloader tries to use too much memory. Is there a way to limit memory and core usage in Fastai2?

This question would be better suited for this thread: Fastai v2 text as I do not know the answer @hahmed988 :slight_smile:

How do you know which learning rates to use in your Full DataSet part?

It’s the default learning rate for fine_tune (which also works great, I’ve spoken about it before I believe)

You can find it here:

Thanks a lot @muellerzr! It was again another awesome lecture!! :100: :books:

Some questions from my side:

What does seq_len mean? I have checked and the original movie reviews all have different lengths in words. How does the model handle it?

Where can we specify the vocab size in the code? In this example data_lm.vocab is 7080
I noticed nevertheless that when running data_lm.o2i.items() there are only 7050 items (not 7080!) the rest of the words are mapped to the unknown token. I find this very weird.

At some point we do learn.load_encoder('fine_tuned_enc'), what are we exactly re-using? The learned embeddings for our words or also some part of the LSTM that was used to predict the next word in the sentence?

Finally… how does the decoder look like? i.e. the model that does the actual classification

1 Like

Great questions @mgloria :slight_smile: I’ll be going much more in depth on Wednesday over these. I wanted to but didn’t have enough time to prepare for that.

We pad the sequences so they all fit :slight_smile: It is indeed the 72, check the raw tensor values :wink:

For instance from our TextBlock we can specify it as max_vocab which defaults to 60000 (though it can be less). Not sure about the 30 less tokens.

Think of it as all but the last layer in our original language model, exactly the same way we transfer via our resnet34, all but that last layer :slight_smile:

I’ll go into that a bit more this week, great question! The classifier is a PoolingLinearClassifier similar to our head from our vision models, but specific for language models :slight_smile:

You can find it here:

And for the “how do we build the classifier itself” it’s full code is here:

You can see the encoder is our language model (arch) which we then want to load it’s weights in for and then the PoolingLinearClassifier is our “head”

1 Like

I see…! This clarifies a few things. :smile: Thanks a lot @muellerzr!

How can check the raw tensor values? I was looking for something like dls.train_ds or similar but I cannot find how to get the actual data back. Moreover, it looking at show_batch seems to me as if the original reviews had been cut (maybe to be sequences of length 72), is this correct?

Regarding the max_vocab, I do not see where we are specifying it in the code. Shouldnt it take the default value then? i.e. 60000 terms

You can do one_batch() or next(iter(dl))

Yes a maximum of 60,000 if we can do it, but we only need that 7,000 or so. And in the call to TextBlock.from_df(max_vocab=x))

1 Like

The default here is 2e-3 whereas you are using in the notebook learn.fit_one_cycle(1, 2e-2) where does these come from?

Also I can’t understand how you find the following learning rates:

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

Many Thanks for this great course

Those were based on Jeremy’s notebook, and it’s a general rule of thumb for this type of learning :slight_smile:

1 Like

Re: Lecture tonight, I’ll post a recording tonight/tommorow morning. My internet is far from stable to do the live streaming :frowning:

3 Likes

My best guess (before I have the chance to dig into the nb and supporting code) at the difference between the data_lm.vocab being 7080 and data_lm.o2i.items() being 7050 with the rest of the words mapped to the unk token is this: Typically when a word occurs less than a certain number of times (say 5 times) in the text then it is mapped to unk. That is probably what is going on here but will need to dig through nb+code to confirm.

2 Likes

What is tokenizer?

Is data_lm.o2i.items() being generated with help of spacy tokenizers?

@muellerzr Hope things are well at your end! Just wondering when the next lecture is?

2 Likes

@muellerzr just wondering when the next lecture will be?

I’m afraid there won’t be one, after starting this I realized my knowledge of NLP needs a bit more before I dive into this and my time constraints limit this. I’d recommend Rachel’s course instead, perhaps a few of you following this could combine your efforts to try to recreate the notebooks in fastai2 even! Apologies :slight_smile:

6 Likes

@muellerzr I am looking for an NLP classification model for italian language? any suggestions