A walk with fastai2 - Text - Study Group and Online Lectures Megathread

muellerzr · March 8, 2020, 5:56pm

Please use this topic for the NLP portion of A walk with fastai2!

FAQ:

Videos:

Link to the stream:

I will update this link a few hours before each week. Here is the one for this week:
They will not go live until 5pm cst

Notebooks we have covered:

01_Intro

Schedule

this schedule is subject to change

Block 3, Text (April 8th - Early May (TBD)):

Lesson 1: Introduction to NLP and the LSTM
Lesson 2: The LSTM, More on Tokenizers, Ensembling
Lesson 3: Other State-of-the-Art NLP Models
Lesson 4: Multi-Lingual Data, DeViSe

Closing notes

This will be my first time live-streaming so this will be an experiment for everyone but I have high hopes that this will turn out to be a successful study group with your help! Please use this thread for any questions and starting discussions about this material, we’re all learning fastai (and especially the second version) together! I will update this post with youtube links to the livestream, as well as post on this thread as well. Looking forward to seeing everyone next month!!!

(Also minor PSA, this is in no way for any credit whatsoever. I am just an undergraduate student wanting to help others learn how to use this amazing library to its fullest potential. Instead of worrying about credit, try using what you’ve learned into a project or two and some blogs, this provides evidence you know the material much better than a slip of paper can in some cases )

muellerzr · April 8, 2020, 6:01pm

Okay guys so here is the link to today’s lesson:

We’ll be covering a lot in the first lesson so we’ll mostly just be looking at the API and a hint on the major architecture fastai uses, and next lecture we’ll go full in-depth on an LSTM

muellerzr · April 8, 2020, 11:17pm

We’re live, sorry about that!

hahmed988 · April 13, 2020, 8:04am

I am trying to replicate this notebook on Kaggle Kernel on a different dataset, the TextDataLoaders generation tends to run on CPU even when GPU is enabled. Is the default setup as CPU for text api ?

Additional the kernel dies, because TextDataloader tries to use too much memory. Is there a way to limit memory and core usage in Fastai2?

muellerzr · April 13, 2020, 1:02pm

This question would be better suited for this thread: Fastai v2 text as I do not know the answer @hahmed988

hackerbear · April 13, 2020, 6:43pm

How do you know which learning rates to use in your Full DataSet part?

muellerzr · April 13, 2020, 6:48pm

It’s the default learning rate for fine_tune (which also works great, I’ve spoken about it before I believe)

You can find it here:

github.com

fastai/fastai2/blob/master/fastai2/callback/schedule.py#L157


    n_epoch = cycle_len * (cycle_mult**n_cycles-1)//(cycle_mult-1)
    pcts = [cycle_len * cycle_mult**i / n_epoch for i in range(n_cycles)]
    scheds = [SchedCos(lr_max, 0) for _ in range(n_cycles)]
    scheds = {'lr': combine_scheds(pcts, scheds)}
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd)


# Cell
@patch
@log_args(but_as=Learner.fit)
@delegates(Learner.fit_one_cycle)
def fine_tune(self:Learner, epochs, base_lr=2e-3, freeze_epochs=1, lr_mult=100,
              pct_start=0.3, div=5.0, **kwargs):
    "Fine tune with `freeze` for `freeze_epochs` then with `unfreeze` from `epochs` using discriminative LR"
    self.freeze()
    self.fit_one_cycle(freeze_epochs, slice(base_lr), pct_start=0.99, **kwargs)
    base_lr /= 2
    self.unfreeze()
    self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)


# Cell
@docs

mgloria · April 13, 2020, 8:00pm

Thanks a lot @muellerzr! It was again another awesome lecture!!

Some questions from my side:

What does seq_len mean? I have checked and the original movie reviews all have different lengths in words. How does the model handle it?

Where can we specify the vocab size in the code? In this example data_lm.vocab is 7080
I noticed nevertheless that when running data_lm.o2i.items() there are only 7050 items (not 7080!) the rest of the words are mapped to the unknown token. I find this very weird.

At some point we do learn.load_encoder('fine_tuned_enc'), what are we exactly re-using? The learned embeddings for our words or also some part of the LSTM that was used to predict the next word in the sentence?

Finally… how does the decoder look like? i.e. the model that does the actual classification

muellerzr · April 13, 2020, 8:16pm

Great questions @mgloria I’ll be going much more in depth on Wednesday over these. I wanted to but didn’t have enough time to prepare for that.

We pad the sequences so they all fit It is indeed the 72, check the raw tensor values

For instance from our TextBlock we can specify it as max_vocab which defaults to 60000 (though it can be less). Not sure about the 30 less tokens.

Think of it as all but the last layer in our original language model, exactly the same way we transfer via our resnet34, all but that last layer

I’ll go into that a bit more this week, great question! The classifier is a PoolingLinearClassifier similar to our head from our vision models, but specific for language models

You can find it here:

github.com

fastai/fastai2/blob/master/fastai2/text/core.py#L101


      
          def lowercase(t, add_bos=True, add_eos=False):
              "Converts `t` to lowercase"
              return (f'{BOS} ' if add_bos else '') + t.lower().strip() + (f' {EOS}' if add_eos else '')
          
          # Cell
          def replace_space(t):
              "Replace embedded spaces in a token with unicode line char to allow for split/join"
              return t.replace(' ', '▁')
          
          # Cell
          defaults.text_spec_tok = [UNK, PAD, BOS, EOS, FLD, TK_REP, TK_WREP, TK_UP, TK_MAJ]
          defaults.text_proc_rules = [fix_html, replace_rep, replace_wrep, spec_add_spaces, rm_useless_spaces,
                                      replace_all_caps, replace_maj, lowercase]
          defaults.text_postproc_rules = [replace_space]
          
          # Cell
          class BaseTokenizer():
              "Basic tokenizer that just splits on spaces"
              def __init__(self, split_char=' ', **kwargs): self.split_char=split_char
              def __call__(self, items): return (t.split(self.split_char) for t in items)

And for the “how do we build the classifier itself” it’s full code is here:

github.com

fastai/fastai2/blob/master/fastai2/text/core.py#L118


      
              "Basic tokenizer that just splits on spaces"
              def __init__(self, split_char=' ', **kwargs): self.split_char=split_char
              def __call__(self, items): return (t.split(self.split_char) for t in items)
          
          # Cell
          class SpacyTokenizer():
              "Spacy tokenizer for `lang`"
              def __init__(self, lang='en', special_toks=None, buf_sz=5000):
                  self.special_toks = ifnone(special_toks, defaults.text_spec_tok)
                  nlp = spacy.blank(lang, disable=["parser", "tagger", "ner"])
                  for w in self.special_toks: nlp.tokenizer.add_special_case(w, [{ORTH: w}])
                  self.pipe,self.buf_sz = nlp.pipe,buf_sz
          
              def __call__(self, items):
                  return (L(doc).attrgot('text') for doc in self.pipe(map(str,items), batch_size=self.buf_sz))
          
          # Cell
          WordTokenizer = SpacyTokenizer
          
          # Cell
          class TokenizeWithRules:

You can see the encoder is our language model (arch) which we then want to load it’s weights in for and then the PoolingLinearClassifier is our “head”

mgloria · April 13, 2020, 8:48pm

I see…! This clarifies a few things. Thanks a lot @muellerzr!

How can check the raw tensor values? I was looking for something like dls.train_ds or similar but I cannot find how to get the actual data back. Moreover, it looking at show_batch seems to me as if the original reviews had been cut (maybe to be sequences of length 72), is this correct?

Regarding the max_vocab, I do not see where we are specifying it in the code. Shouldnt it take the default value then? i.e. 60000 terms

muellerzr · April 13, 2020, 8:50pm

You can do one_batch() or next(iter(dl))

Yes a maximum of 60,000 if we can do it, but we only need that 7,000 or so. And in the call to TextBlock.from_df(max_vocab=x))

hackerbear · April 14, 2020, 4:19pm

The default here is 2e-3 whereas you are using in the notebook learn.fit_one_cycle(1, 2e-2) where does these come from?

Also I can’t understand how you find the following learning rates:

learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

Many Thanks for this great course

muellerzr · April 14, 2020, 4:21pm

Those were based on Jeremy’s notebook, and it’s a general rule of thumb for this type of learning

muellerzr · April 15, 2020, 3:29pm

Re: Lecture tonight, I’ll post a recording tonight/tommorow morning. My internet is far from stable to do the live streaming

Srinivas · April 16, 2020, 7:26pm

My best guess (before I have the chance to dig into the nb and supporting code) at the difference between the data_lm.vocab being 7080 and data_lm.o2i.items() being 7050 with the rest of the words mapped to the unk token is this: Typically when a word occurs less than a certain number of times (say 5 times) in the text then it is mapped to unk. That is probably what is going on here but will need to dig through nb+code to confirm.

hello34 · April 19, 2020, 5:27pm

What is tokenizer?

Is data_lm.o2i.items() being generated with help of spacy tokenizers?

vahuja4 · April 28, 2020, 10:24am

@muellerzr Hope things are well at your end! Just wondering when the next lecture is?

hello34 · May 6, 2020, 5:47pm

@muellerzr just wondering when the next lecture will be?

muellerzr · May 6, 2020, 6:40pm

I’m afraid there won’t be one, after starting this I realized my knowledge of NLP needs a bit more before I dive into this and my time constraints limit this. I’d recommend Rachel’s course instead, perhaps a few of you following this could combine your efforts to try to recreate the notebooks in fastai2 even! Apologies

mgloria · May 7, 2020, 8:35am

@muellerzr I am looking for an NLP classification model for italian language? any suggestions