NLP Information Extraction from Text

Anyone have any guidance on how to use Fast.ai or other deep learning library/techniques to extract some specific information from text?

I have a clinical note (text written by a medical professional after a procedure) and want to extract some specific elements from to create a structured version of a report to be stored in a database.

In the courses (at least the parts I’ve completed), the examples working with text are all some sort of classification task.

As an example of a note, here is one from http://www.mtsamplereports.com/colonoscopy-medical-transcription-sample-reports/ - part of which is quoted here:

… The visualized mucosa in the cecum appeared grossly normal. In the proximal ascending colon, a 3 mm polyp was noted and removed in entirety with cold forceps biopsy. Remainder of the visualized mucosa in the ascending colon appeared grossly normal. In the mid transverse colon, a 2 mm polyp was noted and removed in entirety with cold forceps biopsy. In the distal transverse colon, a 7 mm polyp was noted and removed with a cold snare technique. The polyp specimen was easily retrieved. In the splenic flexure, a 1 cm lipoma was noted.

POSTOPERATIVE RECOMMENDATIONS: We will follow up on the biopsy results. If the colon polyps return as adenomatous, the patient will need a repeat colonoscopy in approximately three years.

In my case, I want to retrieve information such as number of polyps, size of polyps, recommendation for follow-up. Some clinical notes also have demographic information such as gender, age, reason for exam (family history, test results, etc) that I want to extract.

Edit:

@srmsoumya had posted a similar question on a different NLP domain back in 2018 - Extracting specific information from documents: NLP - However the only reply was guidance on how to get text from a document, not actually extract the desired information.

So, been thinking about this more and perhaps it could be treated as a classification problem. I’m not sure the best way to go about this, but can imaging at least a few classification phases:

  • Segment text into sentences.
  • Label sentences as important or not
  • Train using ULMFit approach to recognize important sentences
  • For important sentences, label phrases of interest
  • Train using ULMFit approach to recognize phrases of interest
  • Label phrases of interests according to information category
  • Train a model to label phrases according to desired category (polyp, follow-up, exam reason, etc.)

There are a lot of questions here and many of the above ideas I’m not sure how to implement. For example, these items in particular seem difficult:

  • How to generate possible phrases?
  • How to get model to recognize a phrase from a body of text?

The real goal would be to just categorize phrases of interest. This are n-grams that may be anywhere from just 1 word to an entire sentence; I could also imagine a situation where the text of interest crosses sentence boundaries.

Creating a set of possible phrases from a sentence seems to be a combinatorial explosion problem - for example create all uni-grams, all bi-grams, all tri-grams, etc. up to n-grams where n is the length of the sentence. To reduce that list, it seems sensible to first limit the n-gram method to sentences containing text of interest.

I would approach this problem by selecting a small subset of documents, manually labelling all data I’d like to extract (for example by adding special tokens before and after each phrase of interest), and then treating this as a seq2seq problem. I haven’t done a seq2seq implementation with fastai yet, but just checked that it was covered in Part 2 of the course. Once you make it work on a small dataset, then the question of scaling it will come up - I’m not sure if there is any way around doing the manual labelling work first on a sufficiently large dataset. Maybe a tool like Snorkel could help you with automating the dataset creation.

Thanks for the guidance - that sounds like it might work. Now that you’ve pointed me in the right direction, I was able to find a notebook with an example for translation:

I haven’t found the lesson discussing this in the 2019 part 2, but there is one in the 2018 version. Will give this a try and report back!

1 Like

You may also want to look at the notebooks for the nlp course, fastai/course-nlp repo

Ah - thanks! I do see that there is a section at https://www.fast.ai/2019/07/08/fastai-nlp/ called “Seq2Seq translation and the Transformer” that could be applied to my problem space.

Lots more videos to watch and code to review. Thanks!

1 Like

Have you tried NER (Named entity recognition)? With NER you can extract entity from your document.

@abhimanyu100 thanks for the suggestion. Traditional named entity recognition using libraries such as NLTK or spaCy are techniques I’m familiar with already. One problem I’ve seen in projects in the past is that most NER systems are trained against general English and thus have poor precision and recall in the medical domain. There are of course medical specific NER systems, but most of them are clunky and don’t have a good Python API (many seem to be written in Java).

I’m not as familiar with how do do NER with deep learning, hence my question here on this forum thread. From the research I’ve read, deep learning techniques seem to do as well or better with NER than more parse tree based methods.

So far, thanks to recommendations from others on this thread, I’ve been investigating the sequence to sequence and related techniques. They seem promising, but I’ve yet to get a good medical dataset with annotations to verify it will work for my use case. I 'm thinking the medical NLP datasets from the i2b2/n2c2 challenges will be a good option for that. If others are interested, you can request access to them at https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ - I’ve put in a request, but haven’t yet heard back, so not sure how difficult it is to get access.

Thanks for your reply. Yes, for medical releated NER, resources are quite less.
Have you tried or heard about “BioBERT”? It’s
pre-trained biomedical language representation model for biomedical text mining.
And I’ll also be doing Biomedical Named Entity Recognition after some time (probably after 2 week or month) can we connect on some platform for communication related to project?

https://llink.to/?u=https:%2F%2Farxiv.org%2Fpdf%2F1901.08746v3.pdf&e=be65f3bee46f55b684cfa14a0e04f56d

@abhimanyu100 - Thanks for the BioBERT link. I’ve heard of it, but haven’t yet read the Arxiv paper you linked to.

1 Like

@magiclantern Have you been able to get how to do the extraction? I have similar issue here but I am stuck in extracting some specific information from the clinical note

Hi, Have you finish that project? I’m working on a similar projects and I have some questions?