In lesson 4, Jeremy creates a language model data bunch using the code given below, the data on which he is gonna train the language model for IMDB dataset by using transfer learning on the pretrained model WikiText103.
Why doesn’t the DataBunch creation methods not expect vocab of WikiText103? Like he did pass the vocab of data_lm
in the methods used for creation of data_clas
used for final classification tast, but wouldn’t the pretrained model WikiText103 expect the vocab to be equivalently numericalized, just like using resnet required us to pass imagenet_stats
to the learner on which we are doing transfer learning?
bs=48
data_lm = (TextList.from_folder(path)
#Inputs: all the text files in path
.filter_by_folder(include=['train', 'test'])
#We may have other temp folders that contain text files so we only keep what's in train and test
.random_split_by_pct(0.1)
#We randomly split and keep 10% (10,000 reviews) for validation
.label_for_lm()
#We want to do a language model so we label accordingly
.databunch(bs=bs))
data_lm.save('tmp_lm')
learn = language_model_learner(data_lm, pretrained_model=URLs.WT103,
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
#grab all the text files in path
.split_by_folder(valid='test')
#split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
.label_from_folder(classes=['neg', 'pos'])
#remove docs with labels not in above list (i.e. 'unsup')
.filter_missing_y()
#label them all with their folders
.databunch(bs=50))
data_clas.save('tmp_clas')