Ideas on document processing using deep learning

Any idea on how to implement deep learning model like ? I have tried object detections model on a few samples but the network does not seems to learn anything. I am trying to process invoice and extract a few key-valued pairs in it . Any idea or help is appreciated.


HI Neeb hope you are well!

I was building a classifier which required a lot of text to be recognized. I found it rather difficult just using the classification techniques from lessons 1 and 2.

I am working on something else at the moment but will come back to it at some point, however in my travels, I discovered a library that would be ideal for extracting the characters from the images.

Then you could build your model on the text extracted.

Have a jolly day.
mrfabulous1 :smiley::smiley:

1 Like

Hi mrfabulous1, appreciated on your comments, I did build a template matching model using output from tesseract, but this method required a lot of rules and template pre-defined, which would not be a ideal case for me, since we have more than 100 different types of PDF, I am looking for a more generalised method to extract the text, but not sure the correct deep learning method to do this.

Hi Neeb, I’m working on a similar problem for a different type of document, and have been experimenting with the chargrid representation introduced by SAP Research last year. Here’s the paper (it’s been posted in a few other threads):

You may have seen that Rossum published a paper on the basics of their method, a different approach from SAP’s:

There may be heuristics, post processing, etc. that isn’t included in the papers but is important for making a production system work well. I think this type of domain-specific information extraction from documents is still an active area of research, and I don’t know of an open source tool that will do what you’re looking for end-to-end. I also don’t think there are open labeled datasets for invoice understanding, so training a DL model will involve some labeling drudgery.

That said, I’ve implemented part of the chargrid paper and have had some promising early results, so if you decide to work through it and have questions feel free to get in touch.


Hi @thebenedict, thanks for posting the link to both of the papers , I will start to look into them, infact I didn’t know that Rossum has published a paper, I am quite new to the CV community, can I know how you found the paper? Thanks for the help. :smiley:

@Neeb I find most papers in this space on, and Andrej Karpathy’s tool is useful for efficient searching.

1 Like

Wow thebenedict
Thats a lot of Computer Science papers!

Arxiv Sanity Preserver

Built in spare time by ]( to accelerate research.
Serving last 84779 papers from cs.[CV|CL|LG|AI|NE]/stat.ML

mrfabulous1 :smiley::smiley:

Hi, @thebenedict, I am trying to implement the network in table understanding in structured documents but I am stucked on the implementation part, anyone can explain to me about the convolution over sequence mentioned in the paper?

I am guessing they feed the network by sequence consist of wordboxes , but to do this we actually have to pad the input with certain size, but to do padding, we need masking for the network to ignore those pad value, and it seems like masking is not supported in conv2d which is presented in the paper. I am also guessing that ‘?’ in the network diagram is the sequence length?

Could anyone enlighten me on this?

Thank you.

Hi Neeb,

I can’t help there, unfortunately. I’ve made some progress with chargrid-like document representations but I haven’t worked with the approach Rossum uses. If I come across anything that might be useful I’ll update here. Sorry I can’t offer more, good luck.


1 Like

I have been working on a similar problem (document understanding and structured text extraction) and here are some resources/datasets I found useful:

Hope these resources help. @thebenedict @Neeb


Hi @thebenedict,
I am implementing the chargrid paper, nice to talk with you. Have you achieved acceptable accuracy (>80%) with chargrid model?


How was your experience implementing chargrid paper ?. Can you share how the results are like.

Hi @harikrishnanrajeev @phucnsp,

I’ve implemented the chargrid document representation (or something like it), but I’m not using fastai to train models. My goal is to segment documents that look very different from invoices. That said, I have recall > 85% and precision >90% on my dataset, and here’s what a typical document looks like in case it helps:

I’m hoping to get back to fastai/PyTorch soon – it’ll be cheaper and more interesting than the commercial object detection I’m using now.

A few observations:

  1. The chargrid representation appears to be only a little better than training the same model on raw document images, and in some cases it’s slightly worse. As a sanity check it’s worth trying to train a model on your source images in parallel with chargrid.

  2. I tried RGB chargrids instead of grayscale, assigning each of the top 50 most common characters in my dataset a value between [0,0,0] and [255,255,255]. I thought this larger embedding space would help the model be more specific, but the results were significantly worse than with the grayscale style above. This was a surprise to me. They look pretty though:


Also worth checking out this undergraduate thesis by Timo Denk, a student supervised by Christian Reisswig, one of the Chargrid authors. It’s mainly about extending chargrid to a “wordgrid” based on BERT embeddings, but it goes into more implementation details about chargrid than the original paper:

BERTgrid paper based on that thesis:

Hope some of that’s useful.



Hi @thebenedict,
Thank you very much for your response, great sharing information.
I have some concerns:

  • Can you show the ground truth mask? In the chargrid paper they mentioned that chargrid outperforms image-only model on header items, those have small segmented area. For the items which cover big area, both models output very similar result.
  • Do you use pretrained model and resnet backbone? or you replicated exactly the same model as described in chargrid paper.

And also I found a paper from a team in Vietnam which used “chargrid representation + Couple Unet + Self-attention + MultiStage” . Interesting to read but I haven’t implemented yet.


Rethinking Table Recognition using Graph Neural Networks

Does anybody have any suggestion on table structure recognition, i.e. identifying rows, columns, and cell positions in the detected tables.


Guys, I was thinking maybe all of us should gather in a common place; I propose creating a discord server, or any other platform for the same purpose, and joining efforts to implement a document analysis algorithm together.

1 Like

Great idea, I somehow managed to build a working model based on the Rossum paper, although my model has only 600k params compared to 880k params as proposed in the paper ( not sure where gone wrong) , but at least I am still able to make prediction on my current dataset, I would love to help if able to. Thanks :blush:.


Complicated Table Structure Recognition

Dataset creation:


It will certainly help, if you could share your experience working on the Rossum paper in a blog. Is there a github link for the implementation ?. thanks.

Could you send me your contact information to ?