Part 2 Lesson 10 wiki

(Jeremy Howard) #1

Please post your questions about the lesson here. This is a wiki post. Please add any resources or tips that are likely to be helpful to other students.

<<< Wiki: Lesson 9Wiki: Lesson 11 >>>

Lesson resources



Code Snippets

Downloading the data:

curl -OL
tar -xzf aclImdb.tgz

Time Line



  • (0:20:30) IMDB with fastai.text
  • (0:23:10) The standard format of text classification dataset
  • (0:28:08) Difference between tokens and words 1 - spaCy
  • (0:29:59) Pandas chunksize to deal with a large corpus
  • (0:32:38) {BOS} (beginning of sentence/stream) and {FLD} (field) tokens
  • (0:33:57) Run spaCy on multi-cores with proc_all_mp()
  • (0:35:40) Difference between tokens and word 2 - capture semantic of letter case and others
  • (0:38:05) Numericalise tokens - Python Counter() class

Pre-trained Language Model - PreTraining

  • (0:42:16) Pre-trained language model
  • (0:47:13) Map IMDb index to wiki text index
  • (0:53:09) fastai documentation project
  • (0:58:24) Difference between pre-trained LM and embeddings 1 - word2vec
  • (1:01:25) The idea behind using average of embeddings for non-equivalent tokens

Pre-trained Language Model - Training

  • (1:02:34) Dive into source code of LanguageModelLoader()
  • (1:09:55) Create a custom Learner and ModelData class
  • (1:20:35) Guidance to tune dropout in LM
  • (1:21:43) The reason to measure accuracy than cross entropy loss in LM
  • (1:25:23) Guidance of reading paper vs coding
  • (1:28:10) Tips to vary dropout for each layer
  • (1:28:44) Difference between pre-trained LM and embeddings 2 - Comparison of NLP and CV
  • (1:31:21) Accuracy vs cross entropy as a loss function
  • (1:33:37) Shuffle documents; Sort-ish to save computation

Paper ULMFiT (FitLam)

  • (1:44:00) Paper: ULMFiT - pre-trained LM
  • (1:49:09) New version of Cyclical Learning Rate
  • (1:51:34) Concat Pooling
  • (1:52:44) RNN encoder and MultiBatchRNN encoder - BPTT for text classification (BPT3C)

Tricks to conduct ablation studies

  • (1:58:35) VNC and Google Fire Library
  • (2:05:10) SentencePiece; Tokenize sub-word Units

About the Part 2 & Alumni category
(Jeremy Howard) #2

(Phani Srikanth) #16

Could Jeremy speak about use_clr argument in the fit function?

(YangLu) #21

Usually, when we down sample, we increase the number of filters, or depth. when we’re doing sampling from 77 to 44 etc, why are we decreasing the number from 512 to 256?

why not decrease dimension in SSD head? (performance related?)

(Kevin Bird) #22

This was a useful post for me to learn what is happening,

(Vikrant Behal) #24

How fast is this IMDB nb?

(Kaitlin Duck Sherwood) #25

Won’t you have to set the seed in between trn_idx and val_idx? Or won’t the random values be different when you start the permutation which creates val_idx?

(Kevin Bird) #26

What do you mean how fast is it?

(blake west) #27

Where do we get the IMDB data? Or where can we find out how to get it?

(WG) #28

Why would you not want to include a “header” in the .csv file for NLP?

(Mandar Deshpande) #29

The mentioned indexes are different for both train and val, so it won’t affect each other.

(William Horton) #30

The seed is where it starts out, by calling np.random.permutation to get trn_idx you change the state in the random number generator so that val_idx ends up different

(Maureen Metzger) #31

@blakewest, check Part 1 notebooks, second(?) lesson on NLP has link

(Kevin Bird) #32

Pretty sure there is a link in lesson 4 nb

(Emil) #33
cd ~/fastai/courses/dl2/data
curl -OL
tar -xzf aclImdb.tgz

(WG) #34

For language modeling, why is there a “labels” column? It isn’t even used in language modeling.

(Gerardo Garcia) #35

Where’s the data? for the aclImdb??

(YangLu) #36

When I’m working with NLP data, manytimes I come across data with foreign texts/characters.

In your infinite experience, is it better to discard them or to keep them? (Are they worth keeping?)

(Erin Pangilinan) #37

Where is the GH repo where this is at? For some reason lost my place (sorry about this)? Would like a resource link at the top.

(YangLu) #38

… and chunksize is in lines? bits? bytes?