Starting a thread specific to collate information on datasets and code repositories that can help with “document layout analysis”.
As per wikipedia : document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.
Digitization of documents , document layout analysis etc are major real world problems in banking domain.
Please let know if you know of any good dataset / code repository for document layout analysis .
Prima research provides few layout datasets that you can login and request from their website . https://www.primaresearch.org/datasets
I will update the thread as and when i get more information.