I am working on a project that involves finding dozens of bounding boxes in scanned, handwritten documents. I want to obtain the areas of the document that contain specific expressions. Kinda like what google translate does on images, highlighting places where it sees text. Once obtained, I want to perform OCR on those areas to get the text.
My question is, how should I go about building a model (thinking about CNNs here) capable of predicting those boxes? I have thought of the following:
- The usual (x,y,w,h) tuple wont work because I have a variable (usually very high) number of boxes per document.
- Things like Faster RCNN and YOLO perform classification on the boxes, which I don’t want.
- Segmentation could help find the areas of interest, albeit with pixel precision. I could then use some algorithm to get the bounding boxes given a segmentation mask. That added step could prove to be quite difficult though.
- Dividing the document into a grid and predicting for each grid cell if it is inside or outside of a box could also work, but could again lead to non-rectangular boxes, which I will need to fix afterwards. This added step looks like it would be easier the one in segmentation though. This is currently my best candidate.
Any suggestions on how to approach it are welcome. Links to code that does something like this would be very useful as well.