Named entity Recognition Using Bi-LSTM-CRF


(Ankit Gupta) #1

Hi All,
I am currently working on creating a NER model for extracting information from resumes, using a Bi-LSTM-CRF model which takes word vectors as input. I have created the word vectors on my own using word2vec.
I am not able to figure out what input should I send to the model if a word is not present in the embedding matrix vocabulary. Currently I am sending a vector of zeros and it is performing very badly as expected.
Please help!

Thanks in advance!


(Sam) #2

If you are creating your own vectors, how often are you running into out-of-vocabulary words?


(Ankit Gupta) #3

The vocabulary that I am using is about 1.3million in size, so that is pretty exhaustive.
I usually run out of vocabulary if I get a wrongly spelled word. But that is our main goal for this project - To be able to identify entities based on context!


(Sam) #4

There was a long thread on this here.

It seems like, if you want to recognize words that aren’t in your training vocabulary, you might want to consider character-level embeddings like fasttext?


(Ankit Gupta) #5

Thanks for the link!
Yes I have tried out those embeddings but they didn’t work well on a corpus made up of resumes. In my opinion, it was mainly because fasttext considers both semantic and morphology of a word, and in my case the corpus was filled with entities like organization names, abbreviations, education degress and location, where the morphology of a word doesn’t really matter.
i guess fasttext would work well on a scientific text corpus, where morphology of a word is much more relevant.


(WG) #6

Are you trying to simply identify documents that have NER information in it (e.g., found a PERSON in this document), or are you also trying to actually locate the identified entities (e.g., found a PERSON at this particular start/end character index in this document)?

If that later, what are your target values for train/validation?


(Ankit Gupta) #7

I am trying to locate the entities in a document. By target values do you mean the entities?


(WG) #8

I mean what are you trying to predict when you train your model.

Your input is a numericalized document … what is the expected output?


(Ankit Gupta) #9

I am trying to tag every word as an entity. So for eg.
if my input is:
I am working as a Software Engineer at Google.
my output will look like:
[OTH, OTH, OTH, OTH, OTH, B-TITLE,I-TITLE,OTH,B-COMPANY]

So as you can see, every word has a corresponding entity, and the input is tokenised on spaces.


(WG) #10

Ok cool …

So the problem is at inference time, when you are trying to make a prediction as to the type of entity for a given word but the model has never seen that word, correct?

If so, one idea would be to substitute all the unknown words with similar known embeddings. There are some ideas on how this could be done in the Lesson 11 (and or 12) notebooks I believe.


(Ankit Gupta) #11

Ok. Thanks for the suggestion. I’ll look into the lectures you mentioned!


(WG) #12

If you want to post a gist of your work as you get things moving, I’d be glad to take a look at it when I have time. Good luck!


(Ankit Gupta) #13

Sure! I’ll just write a summary of what I have done till now and what problems I have been facing.
My goal is to identify the entities present in a resume. The entities are :
Company, Company Designation,Educational Organization, Educational Degree, Educational Major
For this I have created a Named Entity Recognition Model in tensorflow using Bi-LSTM for context encoding and CRF for determining labelling patterns.
I created my input word vectors over a large corpus comprising resumes and some scrapped linkedin data.