Named entity Recognition Using Bi-LSTM-CRF

ankit0110 · September 11, 2018, 8:05am

Hi All,
I am currently working on creating a NER model for extracting information from resumes, using a Bi-LSTM-CRF model which takes word vectors as input. I have created the word vectors on my own using word2vec.
I am not able to figure out what input should I send to the model if a word is not present in the embedding matrix vocabulary. Currently I am sending a vector of zeros and it is performing very badly as expected.
Please help!

Thanks in advance!

samh · September 11, 2018, 4:47pm

If you are creating your own vectors, how often are you running into out-of-vocabulary words?

ankit0110 · September 11, 2018, 5:04pm

The vocabulary that I am using is about 1.3million in size, so that is pretty exhaustive.
I usually run out of vocabulary if I get a wrongly spelled word. But that is our main goal for this project - To be able to identify entities based on context!

samh · September 11, 2018, 5:11pm

There was a long thread on this here.

It seems like, if you want to recognize words that aren’t in your training vocabulary, you might want to consider character-level embeddings like fasttext?

ankit0110 · September 11, 2018, 7:13pm

Thanks for the link!
Yes I have tried out those embeddings but they didn’t work well on a corpus made up of resumes. In my opinion, it was mainly because fasttext considers both semantic and morphology of a word, and in my case the corpus was filled with entities like organization names, abbreviations, education degress and location, where the morphology of a word doesn’t really matter.
i guess fasttext would work well on a scientific text corpus, where morphology of a word is much more relevant.

wgpubs · September 11, 2018, 7:16pm

Are you trying to simply identify documents that have NER information in it (e.g., found a PERSON in this document), or are you also trying to actually locate the identified entities (e.g., found a PERSON at this particular start/end character index in this document)?

If that later, what are your target values for train/validation?

ankit0110 · September 11, 2018, 7:20pm

I am trying to locate the entities in a document. By target values do you mean the entities?

wgpubs · September 11, 2018, 7:24pm

I mean what are you trying to predict when you train your model.

Your input is a numericalized document … what is the expected output?

ankit0110 · September 11, 2018, 7:29pm

I am trying to tag every word as an entity. So for eg.
if my input is:
I am working as a Software Engineer at Google.
my output will look like:
[OTH, OTH, OTH, OTH, OTH, B-TITLE,I-TITLE,OTH,B-COMPANY]

So as you can see, every word has a corresponding entity, and the input is tokenised on spaces.

wgpubs · September 11, 2018, 7:51pm

Ok cool …

So the problem is at inference time, when you are trying to make a prediction as to the type of entity for a given word but the model has never seen that word, correct?

If so, one idea would be to substitute all the unknown words with similar known embeddings. There are some ideas on how this could be done in the Lesson 11 (and or 12) notebooks I believe.

ankit0110 · September 11, 2018, 7:57pm

Ok. Thanks for the suggestion. I’ll look into the lectures you mentioned!

wgpubs · September 11, 2018, 8:37pm

If you want to post a gist of your work as you get things moving, I’d be glad to take a look at it when I have time. Good luck!

ankit0110 · September 12, 2018, 7:36am

Sure! I’ll just write a summary of what I have done till now and what problems I have been facing.
My goal is to identify the entities present in a resume. The entities are :
Company, Company Designation,Educational Organization, Educational Degree, Educational Major
For this I have created a Named Entity Recognition Model in tensorflow using Bi-LSTM for context encoding and CRF for determining labelling patterns.
I created my input word vectors over a large corpus comprising resumes and some scrapped linkedin data.

bachsh · December 17, 2018, 7:25am

Hey Ankit,

I’m running into a similar problem like the one you described. Did you end up solving this? What logic did you use for unseen words?

ankit0110 · December 17, 2018, 7:46am

Hey! I used zero vector initialisation itself for representing the meaning of unseen words. I read about the LSTMs and it turned out that they can encode zero initialised vectors by looking at the other words in the sentence.
I also added one hot encoded handcrafted spelling features to the vectors in order to input some more information about the words. For eg. Features like whether the word starts with a caps, if a word contains a punctuation or not, etc.
You could also try adding a CNN to your model, in order to capture character level information about a word.
Regards,
Ankit

xraycat · December 18, 2018, 12:21pm

Hey Ankit
Did you end up using fastai library for your task? Im planning to do something similar for on medical records for my masters thesis , so i’m a little curious

ankit0110 · December 19, 2018, 6:29am

Hey! I used tensorflow to build the model as I was more familiar with its functioning.

xraycat · December 19, 2018, 10:26am

Oh ok. But did u use the embeddings from the wiki lm or something similar?

AjayStark · May 29, 2020, 5:25pm

Hii @ankit0110, have you tried implementing this in fastai ?