ULMFiT - Corpus

How does one return/find the ulmfit corpus/set of word embeddings?

Also, if you wished to add a word to an existing corpus, would this be possible? I assume not.

Check the lecture notes for that particular lecture. A link to download the wiki LM is there.

Do you want to add the word to your own personal model? In that case, yes, you can do that, and it is demonstrated in the lecture, though we don’t explicitly create word embeddings anymore in Part 2. There’s no reason really to “add” a specific word to the base LM, because any specifics that you need for your application can be added by you. The point of transfer learning is to keep the base general (like a model trained off Wikipedia articles), so adding specific words to a generalized model isn’t really the purpose.

What’s the best way to query the LanguageModel?
I’m looking for similar vectors
Or cosine distances

If you’re just looking for vectors to compare similarity, then you can use Stanford’s GloVe word embeddings. :slight_smile: https://nlp.stanford.edu/projects/glove/

1 Like

That, or FastText language-specific embeddings from Wikipedia. You’d also use word embeddings to do the actual transfer learning.