NLP Transfer Learning, Does the transfer learned model have the vocabulary from the pretrained model?

franva · July 1, 2021, 11:15pm

In the Lesson 8 video tutorial, Jeremy said we can use the pretrained Wiki model to train our own model.
And I do remember that he said something about after the transfer learning the language model of your own will not only have the corpus from the pre-trained Wiki model, but also the vocabulary from your language model.

But when I tried myself, I found that after I trained my language model, it does not know the liked and movie words and only has 4400 words in its vocabulary.

Here is the code,

def get_questions(path):
    return words_df['text'].tolist()

word_path = 'words_oversampled.csv'
words_df = pd.read_csv(word_path)

dls_lm = DataBlock(
    blocks = TextBlock.from_df(words_df, is_lm=True),
    get_items=get_questions,
    splitter=RandomSplitter(0.2)
).dataloaders(word_path, bs=80)

# We get 4400 vocabulary
lm_vocab = dls_lm.vocab
len(lm_vocab), lm_vocab[-20:]

the output of len(lm_vocab) is 4400.

After I trained my language model I then tried the next word prediction:

TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [lm_learner.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]
print("\n".join(preds))

Here is the output:

i xxunk this xxunk because of covid wil covid man made how should medical centers respond to a covid patient what is the realistim wellness impact fi covid what happens when works Do get pay ca nt pay the covid
i xxunk this xxunk because of covid what is the best way to deal with stress during lockdown can antibiotics kill covid é covid a bio weapon why covid is worse than flu want are the descriptive statitics for the

As you can see from the output, my language mode does not know the words: liked and movie.
I am pretty sure that a language model trained on Wiki will definitely have more words then 4400 and the liked and movie should be included in the vocabulary of the trained model.

So, what did I miss?

You can replace my csv file with pretty much any data set to give it a try.

florianl · July 3, 2021, 8:38pm

Yes and no ;). To use your pretrained model, you’ll have to save the encoder of the model and the vocab. To use the pretrained model you have to

pass the vocab to the dataloader
create the language_model_learner (for fine-tuning) and load the pretrained encoder / vocab by passing it to `language_model_learner:

lm_fns = [(model_base_path/'lm'/direction/f'{lang}_wikitext_model').absolute(), 
          (model_base_path/'lm'/direction/f'{lang}_wikitext_vocab').absolute()]

with open(f'{lm_fns[1]}.pkl', 'rb') as f:
      vocab = pickle.load(f)

dblocks = DataBlock(blocks=(TextBlock.from_df('text', vocab=vocab, backwards=backwards), CategoryBlock),
                    get_x=ColReader('text'),
                    get_y=ColReader('label'), 
                    splitter=ColSplitter())
dls = dblocks.dataloaders(df, bs=bs, num_workers=num_workers)

learn = language_model_learner(dls, AWD_LSTM, drop_mult=0.5, pretrained=True, pretrained_fnames=lm_fns, 
                               metrics=[accuracy, Perplexity()]).to_fp16()

In your code a new tokenizer / vocab will be created that doesn’t know about the vocab of the pretrained model.

I created a repo for the whole pretraining / fine tuning / classifier process with SentencePiece. That might be interesting to understand how all the stuff works:

If you have any question let me know.