Wiki: Lesson 4

I am following the exact steps of lesson4-imdb.ipynb using the dataset from here:

The first time I do pickle dump:
pickle.dump(TEXT, open(f’{PATH}models/TEXT.pkl’,‘wb’))
it returns this error:
PicklingError: Can’t pickle <cyfunction load.. at 0x7f1c88c7c048>: it’s not found as spacy.vocab.lambda

I deleted TEXT.pkl and tried again, but the kernel died

The machine used at paperspace: P5000 Ubuntu 16.04
Python version 3.6.4
spaCy version 1.9.0
Location /home/paperspace/anaconda3/envs/fastai/lib/python3.6/site-packages/spacy
Platform Linux-4.4.0-104-generic-x86_64-with-debian-stretch-sid
Installed models en

Updating spacy to version 2.0.11 and reinstalling the english model fixed it

1 Like

In the video on 41:29 ( he uses the rule of logs log(y_pred/y) = log(y_pred) - log(y), thus if we predict logs of value rmse (used in net training) will become rmspe. But rmspe is (y_pred-y)/y, we have -y which can’t be thrown away in case of logarithms. Why do we take logs of our target then?

It seems like lesson 4-IMDB doesn’t work properly anymore.

I get that Spacy has been updated so instead of using spacy_tok as our tokenizer when creating the TEXT object, we use TEXT = data.Field(lower=True, tokenize="spacy")

Later on in the Test section though, the following code no longer works.
ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
s = [spacy_tok(ss)]
' '.join(s[0])

The way I fixed this is…

#s = [spacy_tok(ss)] replaced with
s = [TEXT.tokenize(ss)]

However, I am unfamiliar with the various libraries and not confident that this is the correct fix.

1 Like

Check your understanding of the lesson 4

<<< Check your understanding of the lesson 3 | Check your understanding of the lesson 5 >>>

(original post in portuguese)

Hi guys,

I did watch again the video of the lesson 4 (part 1) to get the whole image and I took notes of the vocabulary used by @jeremy.

Let’s play ! OK ? :wink:
Can you give a definition / a url / an explanation for all the followings terms and expressions ?

If yes, you are done with the 4th lesson !!! :sunglasses: :sunglasses: :sunglasses:

PS : you do not want to test yourself or you want to check your answers ? Go to the blog post “Deep Learning 2: Part 1 Lesson 4” of @hiromi : " super travail !!! :slight_smile: "

  • blog posts from fastai students mainly about how to get the best learning rate
    ** learning rate
    ** Stochastic Gradient Descent with Restart (SGDR)
    ** Diferential Learning Rate
    ** (from @bushaev) the difficulty in minimizing the loss arises from saddle points rather than poor local minima
  • computer vision
  • lesson 4 is about 3 concepts :
    ** structured (data) learning (financial, …)
    ** NLP : Language Natural Processing
    ** Collaborative filtering
  • jupyter notebook : we used dropout in DogBreeds Kaggle competition
  • an activation is a number which has been calculated
  • kernel ou filter : same thing
  • a fully connected layer (Linear layer) does a simple matrix multiplication
  • type ‘learn’ in your jupyter to get the last added layers of your learn object (your new model from the pretrained model which is here a ConvNet network) : the fastai library adds by default 2 linear layers (fully connected layers) with the following structure for each one : BatchNormalization, Dropout, Linear, ReLu (and the last layer gives logSoftMax)
  • dropout
    ** ps = 0.5 means delete each activation with a probability of 50% (at random)
    ** therefore, at each mini batch, we throw away different activations
    ** what does dropout ? it forces o network to generalize better (it forces the network to learn the main caracteristics) and then, avoids overfitting (specialization of your network)
    ** dropout was created about 2013 by Geoffrey Hinton and its team (from how the brain works) : it solves the problem of generalization but why droput was inspired by the brain ? (answer :
    ** high ps value helps generalize well but decrease your training accuracy
    ** low ps value will generalize less well but will give you a better training accuracy
    ** why in early training, my validation losses is better (lower) than my training losses ? The reason is we turn off the dropout when we look at the validation set.
    ** when we set ps=0.5, pytorch deletes an activation with a probability of 50% and double all the others : then, the average activation does not change
    ** ps=0.5 adds dropout to all the added last layers, but not in the first layers (the pretrained layers)
    ** ps=0 removes dropout and then in the first epoch, my validation accuracy is better (higher) than with ps different of zero but in teh third epoch, it is the contrary : dropout helped a better generalization
    ** ps=0 : massive overfitting from the third epoch !
    ** if you don’t write a ps value in the learn object, there is a default ps value fo each last layer !
    ** ps=[p1,p2] allows to put a different dropout by last layer (idea : the last dropout - the one before the last Linear layer - can be higher than others)
    ** the dropout adds noise in fact
  • add layers
    ** by default, the fastai learn object add 2 Linear layers at the end of the network but you can change the number of last Linear layers and the number of activations of each one by using xtra_fc=[x1,x2,...]
    ** therefore, an empty list xtra_fc=[] creates just one Linear last layer
  • overfitting
    ** bigger models (deeper), bigger dropouts as deeper models tend to overfit much (specialization)
    ** overfitting ? the training loss is much lower than the validation loss
    ** a small overfitting is not a problem but you need to look at at your validation loss and get the smallest number for him
  • Jupyter notebook : Rossman Store
    ** 2 kinds of column : categorical (number of levels) or continuous (number attached)
    ** no feature engineering
    ** a categorical variable tells the neural network that each level (cardinality) means something totally different
    ** a continuous variable comes with a kind of mathematical function
    ** a continuous variable with a low number of levels (cardinality) : better to treat it as a categorical variable
    ** DayOfWeek is a number but it is better to treat it as a categorical variable as each day as a meaning different in relation to the context
    ** then, a categorical data stays a categorical but a continuous data, you have to decide
    ** a continuous variable has floating number
    ** one remark : you can group by interval, numbers of a continuous variable and then turns it into categorical variable (better or not ? there is a paper on that)
  • if in the dataset, if pytorch finds a new categorical variable (category in pytorch) that was not defined, pytorch adds it automatically as an unknown category
  • pandas + Machine Learning course
  • clean data and make type of feature engineering
  • then, convert columns to categorical or continuous variables (float32)
  • idxs = get_cv_idxs() to get a sample (validation set)
  • df, y, nas, mapper = proc_df(joined_samp, 'Sales', do_scale=True) : creates the dependent variable with normalization (do_scale=True)
    ** Neural Net likes to have input data with mean = 0 and standard deviation = 1
    ** mapper keeps the normalization process and then, the same one will be applied to the test set later (that did matches for example from 2014 to 2, from a to 1, etc.)
  • validation set : read thefastai blog post about that
  • what is the metric ? here, RMSPE (Root Mean Square Percentage Error)
  • check the log rules : log (a/b) = log(a) - log(b)
  • md = ColumnModelData.from_data_frame(PATH, val_idx, df, yl, cat_flds=cat_vars, bs=128) : it is important to give cat_flds to tell our model which variable is a categorical variable
  • m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars) (number of continuous variables), dropout of the embedding matrix, prediction of a single number (sales), how many activations in each last layer, dropout at last layers, y_range=y_range)
  • x rank tensor = vector
  • embedding matrix
    ** Jeremy draws a NN with a vector of continuous data in input but what is the situation with categorical variables ? in fact we create an embedding matrix for each categorical variable
    ** It means that this embedding matrix will be the first one of our NN : this another bunch of weight of our NN
    ** the embedding weights mean something but this is the training that will show that
    ** size of each embedding vector ? take the cardinality, divides by 2, not bigger than 50
    ** 1 hot encoding : the problem is that a categorical variable is a concept not linear
    ** embedding vector is a distributed distribution (rich representation, high dimensional concept)
  • when a categorical variable has a high cardinality, you should package it
  • all modern DL libraries, take the index and give back the embedding vector but this is similar to a product matrix between 1 hot encoding vector and the embedding matrix
  • add_datepart() : creates columns ot time series from the name of the column (DayofWeek)
  • after, you can run your NN like for the DogsCats one
  • Pinterest : when they switched from gradient boosting machines to Deep Learning, they did much less features engineering : it is one of the biggest benefit of using DL
  • data augmentation in structured data : jeremy does not know a technique
  • what is the downside of using Deep Learning on structured data (structured Deep Learning model) ? few people work on it and before fastai, there was no easy way to code it (this class is the first time)
    ** Pinterest has a O’Reilly video on that
    ** 2 academics paper from Yoshua Bengio (taxi destination forecast) and the kaggle Rossmann competition
  • Natural Language Processing
    ** NLP (Natural Language Processing) is the most up-and-coming area
    ** it is 2 or 3 years behind computer vision in Deep Learning
    ** it is kind of second area that Dl started to be popular in
    ** DL in computer visions came to state-of-the-art in 2014/2015
    ** NLP is still in a state where the concepts and the use of DL are less mature than computer vision
    ** in the last few month, models and ideas used in computer vision started to be tested in NLP. Therefore, it is quite new
  • Language modeling (ou Model language)
    ** We are going to talk about a particular problem in NLP : language modeling
    ** language modeling : build a model where given a few words of a sentence, can you predict what the next word is going to be ? (swiftkey no mobile for example)
    ** : very popular site on DL papers
    ** Jeremy downloaded from this site 18 months academic papers (topics, titles and summaries)
    ** Then, he build an algorithm to predict the rest of a phrase from the first words
    ** the algorithm did not know English : it starts with an embedding matrix for each word
    ** csn = computer science networking
    ** create a language modeling has many consequences : learn to write correct English, learn to put single into parentheses
    ** we are creating a Language Model because we want to create a pretrained model and use it within a IMDB movie review to detect positive and negative sentence (it does not work to directly create a model to detect a positive or negative review)
    ** Why ? because of many reasons :
    ** 1) fine tuning of a pretrained network works very well : the networks has detected many characteristics of a “world” (it “understood” the world) and then we can use it for another related task
    ** 2) IMDB movie reviews are big (1000 words long), then after reading 1000 words, knowing nothing about how English is structured or even what is the concept of a word or punctuation and try to give a positive or negative sentiments (1 or 0) is just too much to expect
  • Usually Language Model works at words level
    ** you can create your American Novel generator :slight_smile:
    ** we focus here on text classification which is very powerful (hedge fund to classify news that have impacts, customer service to detect people who will cancel their contracts in the next month)
  • lesson4-imdb.ipynb
    ** new things we import : torchtext (NLP library)
    ** after library importation and PATH setup, we tokenize the training dataset (words, punctuations) using spacy (the best tokenizer)
    ** ' '.join(spacy_tok(review[0])) : puts a space between 2 tokens
    ** with torchtext, we are going to create a torchtext field : how to pre-process some a piece of text like making everything lowercase and tokenize it : TEXT = data.Field(lower=True, tokenize=spacy_tok)
  • We can now create our model :
    ** FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
    ** md = LanguageModelData(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)
  • min_freq : we will remplace all words by indexes (a word has an unique index) but if a word occurs less than 10 times, just call it unknown
    ** bptt = backprop though time is where we define how long a sentence will stick on a GPU at once (we will break sentences into bptt tokens or less)
    ** Then, we add a very important attribute to our torchtext field : TEXT.vocab
    ** this is the vocabulary : list of unique words in the corpus classed by frequency or list of their indexes
    ** context is very important in NLP !
    ** note : bag of words is not longer useful because it does not keep the order between orders
  • bptt
    **we concatenated all review in one file. Then, we split into batches (64 for example). The, we move into columns these 64 words splits. Then, we grab a bptt (70 tokens = 70 lines) of this matrix that will give to our GPU.
    ** a batch size in NLP is a batch of 64 bits and each bit is a sequence of length bptt (70 here)
    ** next(iter(md.trn_dl)) gives you a batch like the one uses by your GPU
    ** warning : randomly, torchtext changes the bptt size around 70 for each batch
    ** the first column represents the first 75 words of the first review (bptt = 75)
    ** The second column represents the first 75 words of the second of the 64 segments (you have to go in like 10 millions words to find that one)
    ** The last column is the first 75 words of the last of those 64 segments
  • Now, we have to create an embedding matrix for each word or our corpus (1 word is a categorical variable)
    ** len(md.trn_dl) : number of blocks
    ** md.nt : number of tokens
    ** len(md.trn_ds) : length of dataset (1 here, because we put all review in 1 file which is our corpus)
    ** len(md.trn_ds[0].text) : number of words in our corpus
    ** we define em_sz (size of our embedding matrix : 200), 3 layers with 500 activations by layer
  • put Adam optimizer with opt_fn = partial(optim.Adam, betas=(0.7, 0.99))
  • recent Language model : we use AWD LSTM from Stephen Merity (important about dropout)
  • another way to avoid overfitting : learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
  • Gradient clipping : learner.clip=0.3 (avoid a learning rate above 0.3)
  • technique to use pretrained model for sentiment analysis more powerful than WordToVec
  • Here, we use a Recurrent Neural Network using LSTM (long short term memory)
  • NLP researchers :
    ** James Bradbury
    ** Sebastian Ruder
  • collaborative filtering (see lesson 5)

Is there any reasoning behind not creating one single embedding matrix for all the categorical variables inside it. I thought this would help us in finding more correlation behind the different variables as the networks trains and the weights of everything having a relation would be close to each other?

With the ColumnarDataModel, is there a way to also include a categorical variable where each observation can have multiple categories. For instance, if we have a dataset on sold items of clothing, aside from all the typical continuous and categorical variables, what if we have a column called, say material. In this column, each observation (row) can contain more than one category. For instance, if a customer purchased a pair of pants, for that row under Material, there would be a list [“cotton”, “polyester”, “spandex”, “wool”, … and etc.].

We can you use some sort of word embeddings for these, too, right? A bag-of-words type embedding, where the order of words doesn’t matter, just the words in that observation. I guess this is similar to the LanguageModelData. So, is it possible to combine these two?

I would appreciate any responses. Just trying to wrap my head around it.

1 Like

Yes, it would be possible to do it. One thing that you might have to take into consideration is the sequence length and padding. Probably find the maximum list length and append it as a part of your input tensor and if the item is not present in the list then you can substitute with zeros.

Hey guys,

I have a question concerning the “lesson3-rossman.ipynb” notebook:

In the end i get an error at following line:


TypeError Traceback (most recent call last)
in ()
----> 1 pred_test=m.predict(True)

~/ml/fastai/courses/dl1/fastai/ in predict(self, is_test, use_swa)
355 dl = if is_test else
356 m = self.swa_model if use_swa else self.model
–> 357 return predict(m, dl)
359 def predict_with_targs(self, is_test=False, use_swa=False):

~/ml/fastai/courses/dl1/fastai/ in predict(m, dl)
221 def predict(m, dl):
–> 222 preda,_ = predict_with_targs_(m, dl)
223 return to_np(

~/ml/fastai/courses/dl1/fastai/ in predict_with_targs_(m, dl)
232 if hasattr(m, ‘reset’): m.reset()
233 res = []
–> 234 for *x,y in iter(dl): res.append([get_prediction(m(*VV(x))),y])
235 return zip(*res)

TypeError: ‘NoneType’ object is not iterable

Does anyone know how to fix this?

Wish you a great sunday :slightly_smiling_face:


I am also facing the same error in Test code. I am trying to run the notebook on colab with my own dataset. I get in the TEXT.numercalize(s) statement. Any help would be highly appreciated. @jeremy @hiromi, did you guys face any such problem earlier.?

Below is the error in detail:

AssertionError Traceback (most recent call last)
in ()
2 ss=""“duniya se hat ke ik nai duniya bana saken”""
3 s = [spacy_tok(ss)]
----> 4 t=TEXT.numericalize(s)
5 ’ '.join(s[0])

/usr/local/lib/python3.6/dist-packages/torchtext/data/ in numericalize(self, arr, device, train)
315 arr = arr.contiguous()
316 else:
–> 317 arr = arr.cuda(device)
318 if self.include_lengths:
319 lengths = lengths.cuda(device)

/usr/local/lib/python3.6/dist-packages/torch/ in cuda(self, device, async)
67 else:
68 new_type = getattr(torch.cuda,
—> 69 return new_type(self.size()).copy
(self, async)

/usr/local/lib/python3.6/dist-packages/torch/cuda/ in _lazy_new(cls, *args, **kwargs)
356 @staticmethod
357 def _lazy_new(cls, *args, **kwargs):
–> 358 _lazy_init()
359 # We need this method only for lazy init, so we can remove it
360 del

/usr/local/lib/python3.6/dist-packages/torch/cuda/ in _lazy_init()
118 raise RuntimeError(
119 "Cannot re-initialize CUDA in forked subprocess. " + msg)
–> 120 _check_driver()
121 torch._C._cuda_init()
122 torch._C._cuda_sparse_init()

/usr/local/lib/python3.6/dist-packages/torch/cuda/ in _check_driver()
60 Found no NVIDIA driver on your system. Please check that you
61 have an NVIDIA GPU and installed a driver from
—> 62""")
63 else:
64 # TODO: directly link to the alternative bin that needs install

Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from

In the Rossmann model, as I understand it, it is not really a time-series prediction. The model does not take into account the the sales on the previous days. It only looks at the status of the one day in question and make a prediction from that data.

However, in Lesson 6 Jerry mentions that the third place winners deleted data where the stores are closed. Why would this matter if the model only looks at data from one day at a time?

I can’t find this dataset either. from Arxiv


IMO, this is true to some extent.
But we embed categorical variables, and some of them are related to TIME.

cat_vars = ['Store', 'DayOfWeek', 'Year', 'Month', 'Day', 'StateHoliday', 'CompetitionMonthsOpen',
    'Promo2Weeks', 'StoreType', 'Assortment', 'PromoInterval', 'CompetitionOpenSinceYear', 'Promo2SinceYear',
    'State', 'Week', 'Events', 'Promo_fw', 'Promo_bw', 'StateHoliday_fw', 'StateHoliday_bw',
    'SchoolHoliday_fw', 'SchoolHoliday_bw']

Then, intuitively, the neural network would try to learn something from the pairs of sales and embeddings corresponding to time (yy/mm/dd).

I’m not sure this is enough but hope this will be help.


AFAIK, Jeremy said he had gathered the dataset by his own and the dataset was not public.
Additionally, he said we would use a publicly available dataset in part 2.

According to this paper, it does a good approximation of RMSPE so long as the expected errors are pretty small. See page 3 and the appendix.

But I still don’t really get it.


Hello, I’d like to do something similar. Did you have any luck handling a sequence? I tried extending the MixedInputModel, ColumnarDataSet and ColumnarDataModel to handle a sequence that would then have an embedding applied, flattened and with the categorical and continuous data in the forward, but I couldn’t get it to work.

Hello all,
Can you refresh my memory…

what are parameters is_reg and is_multi in ColumnarModelData ??

How do I figure this out for myself? Not having to ask these questions but figure out from documentation?

??ColumnarModelData.from_data_frames did not tell me much

They control the activation function of the output layer in your model. is_reg=True means that this is a regression task and there will be no activation function used in the last layer. is_multi means multilabel classification task. If set to True then the activation function will be a sigmoid, otherwise softmax will be used. Here is the code:

x = self.outp(x)
if not self.is_reg:
    if self.is_multi:
        x = F.sigmoid(x)
        x = F.log_softmax(x)
elif self.y_range:
    x = F.sigmoid(x)
    x = x*(self.y_range[1] - self.y_range[0])
    x = x+self.y_range[0]
return x

You should set them based on your task.

I see that invalid combinations such as is_reg=True and (is_multi=True or is_multi=False) will still be processed. Should we add some warning here?