In the Lesson 8 video tutorial, Jeremy said we can use the pretrained Wiki model to train our own model.
And I do remember that he said something about after the transfer learning the language model of your own will not only have the corpus from the pre-trained Wiki model, but also the vocabulary from your language model.
But when I tried myself, I found that after I trained my language model, it does not know the
movie words and only has 4400 words in its vocabulary.
Here is the code,
def get_questions(path): return words_df['text'].tolist() word_path = 'words_oversampled.csv' words_df = pd.read_csv(word_path) dls_lm = DataBlock( blocks = TextBlock.from_df(words_df, is_lm=True), get_items=get_questions, splitter=RandomSplitter(0.2) ).dataloaders(word_path, bs=80) # We get 4400 vocabulary lm_vocab = dls_lm.vocab len(lm_vocab), lm_vocab[-20:]
the output of
len(lm_vocab) is 4400.
After I trained my language model I then tried the next word prediction:
TEXT = "I liked this movie because" N_WORDS = 40 N_SENTENCES = 2 preds = [lm_learner.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)] print("\n".join(preds))
Here is the output:
i xxunk this xxunk because of covid wil covid man made how should medical centers respond to a covid patient what is the realistim wellness impact fi covid what happens when works Do get pay ca nt pay the covid i xxunk this xxunk because of covid what is the best way to deal with stress during lockdown can antibiotics kill covid é covid a bio weapon why covid is worse than flu want are the descriptive statitics for the
As you can see from the output, my language mode does not know the words:
I am pretty sure that a language model trained on Wiki will definitely have more words then 4400 and the
movie should be included in the vocabulary of the trained model.
So, what did I miss?
You can replace my csv file with pretty much any data set to give it a try.