Dataset request - document segmenation

ecatkins · January 17, 2019, 4:09pm

Is anyone aware of a dataset for document segmentation? I would be happy with either image segmentation or object detection (e.g. bounding boxes).

This dataset is along the lines of what I am looking for, however I am looking to train a deep learning model to also identify text chunks / paragraphs, and these are not labelled in this case.

marcmuc · January 17, 2019, 7:25pm

I don’t know of a dataset like this, but I have seen a talk by people with a similar task (detecting the corners of a piece of paper on an picture of receipts). They just created their own dataset of only 50 actual images, the rest was done using image augmentation methods. That was enough to train a model to very good accuracy. So that would be a possibility if you don’t find a set.

The other thing would be to use available vision techniques outside of DL to create such a set, e.g. with a modified version of this:

http://www.danvk.org/2015/01/07/finding-blocks-of-text-in-an-image-using-python-opencv-and-numpy.html