Fine tuning word2vec using ULMFiT

Hi everyone,

I’m trying to train a word2vec model in a specific business domain. I’ve collected a relatively large corpus of domain-specific text, trained a model using gensim's basic functionality and got decent results (in the sense that words that are semantically similar in my domain but not in widespread English are indeed more closely embedded when compared to the Google News model).

I’m trying to improve my model, and what I had in mind is using some sort of transfer learning. I thought about fine tuning google’s word2vec - extending its vocabulary and re-training on my domain-specific corpus.

My question is - can I somehow use the method introduced in the ULMFiT paper to fine tune a word2vec model? The two problems seem very similar to me, but I’m not sure how to do this… would appreciate any thoughts/ideas/references on this (the most similar question I’ve found in the forum is this)

Thanks!

You can update the vocab of a gensim model like this:

# load the pre-trained file
model = Word2Vec.load(word2vec_file)

# update the vocabulary with new terms
model.vocabulary.min_count = min_word_freq
model.build_vocab(sentences=new_sentences, update=True)

Then train word2vec again with new_sentences.

If you wanted to transfer some methods from ULMfit, you should first train word2vec on a giant English corpus (like wikitext103), then fine-tune word2vec for your domain specific corpus. It’s different because ULMfit’s last step allows the encoder to change with end-to-end training on a classification, whereas word2vec is learned separately from any classification task. I think Jermey talks about this difference in lesson 8

Thanks for reply.

I tried the update-vocab-keep-training approach you’ve suggested, but encountered a technical difficulty - the google news model is stored in a KeyedVectors format (at least the one I’ve found here), which is not trainable (see here) . Would be happy to hear if there’s any way around this…

Regarding training word2vec on a giant English corpus - that’s the thing I’m trying to avoid, as (I think) I don’t have the necessary compute. Instead, I want to use a pre-trained one and fine tune it for my needs. This is what led me to ULMFiT…

I don’t know the answer to your question, but I’m also interested in how it could be done.
Please let me know if you come up with something. I will do the same.
Cheers!

@adamh: Why don’t you use the ULMFiT language model’s hidden states as word embeddings, similar to what ELMo did? I think you can use ULMFiT/ELMo’s hidden states as word embeddings and build your model on top of that.