Is there a way to extract only text from scanned Images Ignoring table like Structures, So that It can be used for further analysis
OCR converting everything to text Including tables
Currently working on extract Issues from Document, I’m chucking the whole document into sentences and get the sentiment score using Vader Sentiment Analyzer / Bert, and key phrase extraction for the Negative sentences.
Is this the right approach ? or any other better way we can do
I’m not sure I understand – do you want to segment your source image into “table” and “not table” so you can only pass the “not table” segments to OCR?
I’m working on something similar, and it seems like folks have had success using U-nets. Here are two papers I’ve found especially useful. The first does segmentation based on the image alone (for historical documents but the idea seems more general). The second uses text features as well. Both include (non-fastai) code.
Neither of these will solve your problem as-is, but they’re a good start towards segmenting document images with U-nets. I was able to get the page extraction experiment from dhSegment working with fastai’s unet_learner this morning. It would be fun to try and extend that to table detection.
If you’re not looking for a proejct to learn on and just want to find tables, consider something like Amazon Textract or Google’s Cloud Vision API. They do basic document segmentation, and you can run the segments through a classifier if you need to.