Out Of Vocab Words

OmarAmin · December 4, 2018, 5:22pm

I’m dealing with a text processing problem that has many technical terms with a very small amount of data 20k query and response.

Whenever I try to use any pre trained embedding, I have the problem that most of my technical terms don’t appear in the vocabulary for the embedding, and and I don’t have a large corpus to train new embeddings from scratch.

Even most new approaches try to deal with out of vocab words that only relates to the vocabulary of the embedding (as per my current understanding).

Finetuning Elmo or GUEncoder doesn’t produce good results as it only finetunes things that doesn’t relate so much with the task output.

The only way I found that incorporates Out Of Vocab words is to create a new embedding matrix, and build a vocab list from your corpus to get corpus vocab, and filter out their embeddings, and for vocab that aren’t in embeddings, create a new vectors for those and randomly initialize them, but intuitively this doesn’t work also.

What do you suggest in these cases?

lordzuko · January 22, 2019, 5:46am

Have you tried subword embeddings based model/ Byte pair encoding.
http://www.aclweb.org/anthology/P16-1162