Document Layout Analysis datasets and code repos

Hello all,

Starting a thread specific to collate information on datasets and code repositories that can help with “document layout analysis”.

As per wikipedia : document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

Digitization of documents , document layout analysis etc are major real world problems in banking domain.

Please let know if you know of any good dataset / code repository for document layout analysis .

Prima research provides few layout datasets that you can login and request from their website .

I will update the thread as and when i get more information.



Another dataset for document layout analysis


I am also studying the layout problem. (but more on generative side)
I am looking for any thought-provoking discussion here!
This dataset is really useful. They even release pretrained model on it.