In the Lesson 8 video tutorial, Jeremy said we can use the pretrained Wiki model to train our own model.
And I do remember that he said something about after the transfer learning the language model of your own will not only have the corpus from the pre-trained Wiki model, but also the vocabulary from your language model.
But when I tried myself, I found that after I trained my language model, it does not know the liked
and movie
words and only has 4400 words in its vocabulary.
Here is the code,
def get_questions(path):
return words_df['text'].tolist()
word_path = 'words_oversampled.csv'
words_df = pd.read_csv(word_path)
dls_lm = DataBlock(
blocks = TextBlock.from_df(words_df, is_lm=True),
get_items=get_questions,
splitter=RandomSplitter(0.2)
).dataloaders(word_path, bs=80)
# We get 4400 vocabulary
lm_vocab = dls_lm.vocab
len(lm_vocab), lm_vocab[-20:]
the output of len(lm_vocab)
is 4400.
After I trained my language model I then tried the next word prediction:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [lm_learner.predict(TEXT, N_WORDS, temperature=0.75)
for _ in range(N_SENTENCES)]
print("\n".join(preds))
Here is the output:
i xxunk this xxunk because of covid wil covid man made how should medical centers respond to a covid patient what is the realistim wellness impact fi covid what happens when works Do get pay ca nt pay the covid
i xxunk this xxunk because of covid what is the best way to deal with stress during lockdown can antibiotics kill covid é covid a bio weapon why covid is worse than flu want are the descriptive statitics for the
As you can see from the output, my language mode does not know the words: liked
and movie
.
I am pretty sure that a language model trained on Wiki will definitely have more words then 4400 and the liked
and movie
should be included in the vocabulary of the trained model.
So, what did I miss?
You can replace my csv file with pretty much any data set to give it a try.