Numericalization in creating language model data bunch

Narang · April 1, 2020, 9:22pm

In lesson 4, Jeremy creates a language model data bunch using the code given below, the data on which he is gonna train the language model for IMDB dataset by using transfer learning on the pretrained model WikiText103.

Why doesn’t the DataBunch creation methods not expect vocab of WikiText103? Like he did pass the vocab of data_lm in the methods used for creation of data_clas used for final classification tast, but wouldn’t the pretrained model WikiText103 expect the vocab to be equivalently numericalized, just like using resnet required us to pass imagenet_stats to the learner on which we are doing transfer learning?

bs=48
data_lm = (TextList.from_folder(path)
           #Inputs: all the text files in path
            .filter_by_folder(include=['train', 'test']) 
           #We may have other temp folders that contain text files so we only keep what's in train and test
            .random_split_by_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly
            .databunch(bs=bs))

data_lm.save('tmp_lm')

learn = language_model_learner(data_lm, pretrained_model=URLs.WT103,

data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
             #grab all the text files in path
             .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             .label_from_folder(classes=['neg', 'pos'])
             #remove docs with labels not in above list (i.e. 'unsup')
             .filter_missing_y()
             #label them all with their folders
             .databunch(bs=50))
data_clas.save('tmp_clas')

morgan · April 2, 2020, 11:01am

If I remember correctly when creating a language model like this it will generate the vocab for you if you don’t pass it one. Looks like its generated in the definition of TextList

Narang · April 2, 2020, 11:03am

But then since we training a pretrained model on it which would have the weights according to it’s own numericalization, shouldn’t the numericalization of the data bunch to be trained be the same? We are not really passing wikitext103’s vocab in creating data, only doing it for data_clas

morgan · April 2, 2020, 11:21am

A sorry, I misunderstood, so we’re fine-tuning a pretrained model here. In that case it looks like the pretrained model’s vocab is loaded alongside the pretrained weights and then the weights are modified with convert_weights using both the old and new vocab

Narang · April 2, 2020, 1:14pm

Oh ohk I get it. Thanks a lot