I am currently working on creating a NER model for extracting information from resumes, using a Bi-LSTM-CRF model which takes word vectors as input. I have created the word vectors on my own using word2vec.
I am not able to figure out what input should I send to the model if a word is not present in the embedding matrix vocabulary. Currently I am sending a vector of zeros and it is performing very badly as expected.
Thanks in advance!
If you are creating your own vectors, how often are you running into out-of-vocabulary words?
The vocabulary that I am using is about 1.3million in size, so that is pretty exhaustive.
I usually run out of vocabulary if I get a wrongly spelled word. But that is our main goal for this project - To be able to identify entities based on context!
There was a long thread on this here.
It seems like, if you want to recognize words that aren’t in your training vocabulary, you might want to consider character-level embeddings like fasttext?
Thanks for the link!
Yes I have tried out those embeddings but they didn’t work well on a corpus made up of resumes. In my opinion, it was mainly because fasttext considers both semantic and morphology of a word, and in my case the corpus was filled with entities like organization names, abbreviations, education degress and location, where the morphology of a word doesn’t really matter.
i guess fasttext would work well on a scientific text corpus, where morphology of a word is much more relevant.
Are you trying to simply identify documents that have NER information in it (e.g., found a PERSON in this document), or are you also trying to actually locate the identified entities (e.g., found a PERSON at this particular start/end character index in this document)?
If that later, what are your target values for train/validation?
I am trying to locate the entities in a document. By target values do you mean the entities?
I mean what are you trying to predict when you train your model.
Your input is a numericalized document … what is the expected output?
I am trying to tag every word as an entity. So for eg.
if my input is:
I am working as a Software Engineer at Google.
my output will look like:
[OTH, OTH, OTH, OTH, OTH, B-TITLE,I-TITLE,OTH,B-COMPANY]
So as you can see, every word has a corresponding entity, and the input is tokenised on spaces.
Ok cool …
So the problem is at inference time, when you are trying to make a prediction as to the type of entity for a given word but the model has never seen that word, correct?
If so, one idea would be to substitute all the unknown words with similar known embeddings. There are some ideas on how this could be done in the Lesson 11 (and or 12) notebooks I believe.
Ok. Thanks for the suggestion. I’ll look into the lectures you mentioned!
If you want to post a gist of your work as you get things moving, I’d be glad to take a look at it when I have time. Good luck!
Sure! I’ll just write a summary of what I have done till now and what problems I have been facing.
My goal is to identify the entities present in a resume. The entities are :
Company, Company Designation,Educational Organization, Educational Degree, Educational Major
For this I have created a Named Entity Recognition Model in tensorflow using Bi-LSTM for context encoding and CRF for determining labelling patterns.
I created my input word vectors over a large corpus comprising resumes and some scrapped linkedin data.
I’m running into a similar problem like the one you described. Did you end up solving this? What logic did you use for unseen words?
Hey! I used zero vector initialisation itself for representing the meaning of unseen words. I read about the LSTMs and it turned out that they can encode zero initialised vectors by looking at the other words in the sentence.
I also added one hot encoded handcrafted spelling features to the vectors in order to input some more information about the words. For eg. Features like whether the word starts with a caps, if a word contains a punctuation or not, etc.
You could also try adding a CNN to your model, in order to capture character level information about a word.
Did you end up using fastai library for your task? Im planning to do something similar for on medical records for my masters thesis , so i’m a little curious
Hey! I used tensorflow to build the model as I was more familiar with its functioning.
Oh ok. But did u use the embeddings from the wiki lm or something similar?
Hii @ankit0110, have you tried implementing this in fastai ?