Hello Folks,
Trying to come up with a clinical NLP solution (English language) where we need to fill up Form data based on contents from a document. Form fields include person names (some 3 different people – as in sender, receiver and subject in discussion), date and time, and few fields describing the subject and his condition.
We’ve already tried this with an open source tool called cTAKES (based on apache UIMA - https://uima.apache.org/) that comes with models pre-trained on a specific corpus, and makes extensive use of medical dictionaries such as SNOMED, ICD etc.
Currently exploring other options which include building such a pipeline on our own from scratch (need to decide based on feasibility and timelines). Here for suggestions on a solution approach to the same, some gyan based on your experience, comments on possible overall accuracy and a ballpark timeline (assume 2 good developers with limited experience in building models, and 1 guy who can double up as an ok-ish developer – me) in case any of you had worked on similar projects before, may be a different domain.
Also, should it necessarily be a pipeline? is a monolithic architecture possible? what should be the considerations in this regard?
Bit of a surprise, turned out that data wouldn’t necessarily be free text. it may have tables, embedded images, checkboxes, or it could as well be just text. uploading a couple of samples from the internet in case someone wishes to take a look.
and the template isn’t fixed. it varies on case by case basis.
questions…
how feasible is an NLP solution in such a case
what kind of preprocessing would be required if one should make this suitable for NLP processing. Given that we already have an NLP pipeline in the form of cTAKES, we can try running the output of this preprocessing through cTAKES to check if it works as a POC. is OCR (Optical Character Recognition libraries are available off the shelf, or can try building our own CNNs) a good way of converting the above template into something consumable for NLP pipeline, provided we don’t plan to support hand-written text?
Take a look at this Kaggle notebook. I had a go at reproducing the claimed 94% accuracy but only managed 67%, probably could be improved without too much effort.
Hi, thanks for responding. sorry about the delay… we’re not absolutely sure of the input yet, but seems it’ll be mostly printed text rather than handwritten one. But yes, it’ll be a scanned document that we’ll have to make parsable so that it can be fed to NLP.