Bounding box prediction in scanned documents

I am working on a project that involves finding dozens of bounding boxes in scanned, handwritten documents. I want to obtain the areas of the document that contain specific expressions. Kinda like what google translate does on images, highlighting places where it sees text. Once obtained, I want to perform OCR on those areas to get the text.

My question is, how should I go about building a model (thinking about CNNs here) capable of predicting those boxes? I have thought of the following:

  • The usual (x,y,w,h) tuple wont work because I have a variable (usually very high) number of boxes per document.
  • Things like Faster RCNN and YOLO perform classification on the boxes, which I don’t want.
  • Segmentation could help find the areas of interest, albeit with pixel precision. I could then use some algorithm to get the bounding boxes given a segmentation mask. That added step could prove to be quite difficult though.
  • Dividing the document into a grid and predicting for each grid cell if it is inside or outside of a box could also work, but could again lead to non-rectangular boxes, which I will need to fix afterwards. This added step looks like it would be easier the one in segmentation though. This is currently my best candidate.

Any suggestions on how to approach it are welcome. Links to code that does something like this would be very useful as well.


I will just put some of my thoughts next to yours here.

  • First of all what kind of data do you have? Do you even have segmentation masks?
    If you have, making bounding boxes out of this is actually pretty easy, you will find lots of code online.

  • About the classification: ultimately you will want to do classification. However you are looking more at a text or no text binary classification instead of many classes. (You might want to look at focal loss to deal with the “no text” label class imbalance that you might have?)

  • Using a large amount of anchor boxes as in SSD or YOLO should also be able to detect a moderately high number of boxes… However if the amount is too high you will have problems. I’ve recently listened to some Swiss researches from ZHAW that do music symbol detection. They were dealing with the problem of very many very tiny objects (in contrast to what the prominent detectors do on natural images with only a few objects). Depending on how many and how tiny objects you have, their research might be very well worth a look for you:

  • I assume for the OCR step you can just use common state of the art models and throw in your bounding box contents. Packing this thing in one network would definitely be an interesting experiment as well (no idea if other people have done that), however probably a lot harder to start with.

Thanks fore your reply!

  • They are expressions forming a turing machine. Something like (q0, X) = (q1, Y, L), though many variations exists. Dozens (20 to 40) of these per document. Hand written. No data other than the documents. Need to manually label them.

  • Thats an interesting perspective that would allow me to use some models that use classification. Thanks!

  • Yes, SSD and YOLO don’t seem to be a good fit for me. They are also quite big and I would prefer to start small. I also don’t have that much data to begin with. Thanks a lot for the research references, definitely useful.

  • I plan to try out some pre-built SOTA models, but ultimately I’d love to build my own tiny OCR model. I’m much less worried about this step though, there seems to be much more literature and code available for this.

I am not sure if 20-40 objects already exceed what the anchor box methods can comfortably solve. I mean you can kind of determine how many boxes there are at which size and make that match your case.

If you need to label them manually you will have lot of work and you probably want to draw boxes around stuff rather than segmenting it. However if those documents are basically black / dark color on white paper you can use standard computer vision techniques to segment your text pretty easily and maybe just clean your labels a bit afterwards. That way you at least would not have to label everything yourself.

I think the benefit in using SSD or so, especially in fastai is that you already have an implementation for that lying around from the course which you can relatively easy change to your needs. Further you have nice explanations from jeremy. And you can just use a small base network so that those networks are not super large either.

Do you have an example you can share?

I am too working with similar problem, so far i thought of doing these

  • create masks on the intended text to be found
  • use Unet from Lesson 3 camvid to predict the masks with areas of interest, use code from here
  • create bounding boxes over the mask and use this code,
  • extract the bounding boxes as image
  • pass it on to pytesseract or any online image to text API.

Let me know if this helps you!


Hi, sorry for ignoring this thread for so long.
Have you been able to do some progress with your project? Your ideas seemed plausible and are close to what I was looking for, so I am curious about your results!

Hi I am working on a similar project
We can probably use data from

Scroll all the way down until you see ‘palio data set’
They have bounding boxes. For a few PDF documents

A little bit more explination on what I would like to do:
Draw bounding boxes around a pdf document with paragraphs, images, tables, headers, footers etc.,

If any body can point me to a starter notebook, that would be awesome.

HI I am trying to do a bit similar thing with resumes. I am trying to train a maskrcnn model to detect fields like name, skills, education etc. out of the resume. I have just 1000 training examples. Can you guide me somewhere.