How to extract word embeddings from fastai language model?

(Charin) #1

Sorry for the rather obvious question but I couldn’t find an answer in forum.

Context is that I’m trying to train word vectors for Thai language using Wikipedia data dump and the fastai language model introduced in Lesson 4. (Yes, there are several library that extracted the word vectors like fasttext but I wanted to benchmark them and I hope our methods involving SGDR and such could provide better performance).

Now if I understand it correctly, the language model of fastai library will transforms the words into vectors then use them to predict the next word given context. But how do we extract these vectors?

Thanks in advance!

3 Likes

NPL: Using fastai word embeddings to cluster unlabeled documents
(Jeremy Howard (Admin)) #2

The main thing we do is not to create word vectors, but to create a language model. So I’m not sure that extracting the word vectors is what you want.

However, if you do want to do this, you can use the same approach we used to grab the embeddings for the movielens dataset in the last lesson.

2 Likes

(Charin) #3

Thank you for the reply, Jeremy. I reread the notebooks and understand what I’ve got confused now. One thing for extracting the embeddings like we did with movielens. In the lesson 5-6, we were training the embeddings for a collaborative filtering model so our model predicted the ratings of user-movie pairs. But in case I want to extract word embeddings, what would my model be predicting? Would it be the words themselves meaning input equals to output?

1 Like