Extract text from Images and analyze

spaturu · August 21, 2019, 6:38pm

Is there a way to extract only text from scanned Images Ignoring table like Structures, So that It can be used for further analysis

OCR converting everything to text Including tables

Currently working on extract Issues from Document, I’m chucking the whole document into sentences and get the sentiment score using Vader Sentiment Analyzer / Bert, and key phrase extraction for the Negative sentences.

Is this the right approach ? or any other better way we can do

thebenedict · August 21, 2019, 6:55pm

I’m not sure I understand – do you want to segment your source image into “table” and “not table” so you can only pass the “not table” segments to OCR?

I’m working on something similar, and it seems like folks have had success using U-nets. Here are two papers I’ve found especially useful. The first does segmentation based on the image alone (for historical documents but the idea seems more general). The second uses text features as well. Both include (non-fastai) code.

spaturu · August 21, 2019, 8:21pm

do you want to segment your source image into “table” and “not a table” so you can only pass the “not table” segments to OCR?

exactly, Michael, I will pass not table segment to OCR get clean text, so that I can use that text to analyze sentiment, etc…

spaturu · August 23, 2019, 1:23pm

Can you also guide me to the code ?

Thanks in advance

thebenedict · August 23, 2019, 9:08pm

Code and docs for dhSegment are at https://dhsegment.readthedocs.io/en/latest/, and the project page (background/dataset/code) for the semantic structure paper is at http://personal.psu.edu/xuy111/projects/cvpr2017_doc.html.

Neither of these will solve your problem as-is, but they’re a good start towards segmenting document images with U-nets. I was able to get the page extraction experiment from dhSegment working with fastai’s unet_learner this morning. It would be fun to try and extend that to table detection.

If you’re not looking for a proejct to learn on and just want to find tables, consider something like Amazon Textract or Google’s Cloud Vision API. They do basic document segmentation, and you can run the segments through a classifier if you need to.

Also, I haven’t tried this table detection apporoach but it may be worth a look.