Part 2 Lesson 10 wiki

Please post your questions about the lesson here. This is a wiki post. Please add any resources or tips that are likely to be helpful to other students.

<<< Wiki: Lesson 9Wiki: Lesson 11 >>>

Lesson resources



Code Snippets

Downloading the data:

curl -OL
tar -xzf aclImdb.tgz

Time Line



  • (0:20:30) IMDB with fastai.text
  • (0:23:10) The standard format of text classification dataset
  • (0:28:08) Difference between tokens and words 1 - spaCy
  • (0:29:59) Pandas chunksize to deal with a large corpus
  • (0:32:38) {BOS} (beginning of sentence/stream) and {FLD} (field) tokens
  • (0:33:57) Run spaCy on multi-cores with proc_all_mp()
  • (0:35:40) Difference between tokens and word 2 - capture semantic of letter case and others
  • (0:38:05) Numericalise tokens - Python Counter() class

Pre-trained Language Model - PreTraining

  • (0:42:16) Pre-trained language model
  • (0:47:13) Map IMDb index to wiki text index
  • (0:53:09) fastai documentation project
  • (0:58:24) Difference between pre-trained LM and embeddings 1 - word2vec
  • (1:01:25) The idea behind using average of embeddings for non-equivalent tokens

Pre-trained Language Model - Training

  • (1:02:34) Dive into source code of LanguageModelLoader()
  • (1:09:55) Create a custom Learner and ModelData class
  • (1:20:35) Guidance to tune dropout in LM
  • (1:21:43) The reason to measure accuracy than cross entropy loss in LM
  • (1:25:23) Guidance of reading paper vs coding
  • (1:28:10) Tips to vary dropout for each layer
  • (1:28:44) Difference between pre-trained LM and embeddings 2 - Comparison of NLP and CV
  • (1:31:21) Accuracy vs cross entropy as a loss function
  • (1:33:37) Shuffle documents; Sort-ish to save computation

Paper ULMFiT (FitLam)

  • (1:44:00) Paper: ULMFiT - pre-trained LM
  • (1:49:09) New version of Cyclical Learning Rate
  • (1:51:34) Concat Pooling
  • (1:52:44) RNN encoder and MultiBatchRNN encoder - BPTT for text classification (BPT3C)

Tricks to conduct ablation studies

  • (1:58:35) VNC and Google Fire Library
  • (2:05:10) SentencePiece; Tokenize sub-word Units

Could Jeremy speak about use_clr argument in the fit function?


Usually, when we down sample, we increase the number of filters, or depth. when we’re doing sampling from 77 to 44 etc, why are we decreasing the number from 512 to 256?

why not decrease dimension in SSD head? (performance related?)


This was a useful post for me to learn what is happening,


How fast is this IMDB nb?

Won’t you have to set the seed in between trn_idx and val_idx? Or won’t the random values be different when you start the permutation which creates val_idx?

What do you mean how fast is it?

Where do we get the IMDB data? Or where can we find out how to get it?

1 Like

Why would you not want to include a “header” in the .csv file for NLP?


The mentioned indexes are different for both train and val, so it won’t affect each other.

The seed is where it starts out, by calling np.random.permutation to get trn_idx you change the state in the random number generator so that val_idx ends up different

@blakewest, check Part 1 notebooks, second(?) lesson on NLP has link

1 Like

Pretty sure there is a link in lesson 4 nb

1 Like
cd ~/fastai/courses/dl2/data
curl -OL
tar -xzf aclImdb.tgz

For language modeling, why is there a “labels” column? It isn’t even used in language modeling.


Where’s the data? for the aclImdb??

When I’m working with NLP data, manytimes I come across data with foreign texts/characters.

In your infinite experience, is it better to discard them or to keep them? (Are they worth keeping?)


Where is the GH repo where this is at? For some reason lost my place (sorry about this)? Would like a resource link at the top.

… and chunksize is in lines? bits? bytes?