Ideas on document processing using deep learning

#1

Any idea on how to implement deep learning model like rossum.ai ? I have tried object detections model on a few samples but the network does not seems to learn anything. I am trying to process invoice and extract a few key-valued pairs in it . Any idea or help is appreciated.

1 Like

Multi Page Tiff File Document Layout Analysis
#2

HI Neeb hope you are well!

I was building a classifier which required a lot of text to be recognized. I found it rather difficult just using the classification techniques from lessons 1 and 2.

I am working on something else at the moment but will come back to it at some point, however in my travels, I discovered a library that would be ideal for extracting the characters from the images.

Then you could build your model on the text extracted.


Have a jolly day.
mrfabulous1 :smiley::smiley:

1 Like

Extracting Structured Information from Screenshots
#3

Hi mrfabulous1, appreciated on your comments, I did build a template matching model using output from tesseract, but this method required a lot of rules and template pre-defined, which would not be a ideal case for me, since we have more than 100 different types of PDF, I am looking for a more generalised method to extract the text, but not sure the correct deep learning method to do this.

0 Likes

(Michael Benedict) #4

Hi Neeb, I’m working on a similar problem for a different type of document, and have been experimenting with the chargrid representation introduced by SAP Research last year. Here’s the paper (it’s been posted in a few other threads):

You may have seen that Rossum published a paper on the basics of their method, a different approach from SAP’s:

There may be heuristics, post processing, etc. that isn’t included in the papers but is important for making a production system work well. I think this type of domain-specific information extraction from documents is still an active area of research, and I don’t know of an open source tool that will do what you’re looking for end-to-end. I also don’t think there are open labeled datasets for invoice understanding, so training a DL model will involve some labeling drudgery.

That said, I’ve implemented part of the chargrid paper and have had some promising early results, so if you decide to work through it and have questions feel free to get in touch.

3 Likes

#5

Hi @thebenedict, thanks for posting the link to both of the papers , I will start to look into them, infact I didn’t know that Rossum has published a paper, I am quite new to the CV community, can I know how you found the paper? Thanks for the help. :smiley:

0 Likes

(Michael Benedict) #6

@Neeb I find most papers in this space on arxiv.org, and Andrej Karpathy’s tool http://arxiv-sanity.com/ is useful for efficient searching.

1 Like

#7

Wow thebenedict
Thats a lot of Computer Science papers!

Arxiv Sanity Preserver

Built in spare time by ](http://arxiv-sanity.com/)@karpathy to accelerate research.
Serving last 84779 papers from cs.[CV|CL|LG|AI|NE]/stat.ML

mrfabulous1 :smiley::smiley:

0 Likes

#8

Hi, @thebenedict, I am trying to implement the network in table understanding in structured documents but I am stucked on the implementation part, anyone can explain to me about the convolution over sequence mentioned in the paper?

I am guessing they feed the network by sequence consist of wordboxes , but to do this we actually have to pad the input with certain size, but to do padding, we need masking for the network to ignore those pad value, and it seems like masking is not supported in conv2d which is presented in the paper. I am also guessing that ‘?’ in the network diagram is the sequence length?

Could anyone enlighten me on this?

Thank you.

0 Likes

(Michael Benedict) #9

Hi Neeb,

I can’t help there, unfortunately. I’ve made some progress with chargrid-like document representations but I haven’t worked with the approach Rossum uses. If I come across anything that might be useful I’ll update here. Sorry I can’t offer more, good luck.

–Michael

1 Like

(Abi Komma) #10

I have been working on a similar problem (document understanding and structured text extraction) and here are some resources/datasets I found useful:

Hope these resources help. @thebenedict @Neeb

4 Likes

(Phuc Ng. Su) #11

Hi @thebenedict,
I am implementing the chargrid paper, nice to talk with you. Have you achieved acceptable accuracy (>80%) with chargrid model?

2 Likes

(hari rajeev) #12

How was your experience implementing chargrid paper ?. Can you share how the results are like.

0 Likes

(Michael Benedict) #13

Hi @harikrishnanrajeev @phucnsp,

I’ve implemented the chargrid document representation (or something like it), but I’m not using fastai to train models. My goal is to segment documents that look very different from invoices. That said, I have recall > 85% and precision >90% on my dataset, and here’s what a typical document looks like in case it helps:

I’m hoping to get back to fastai/PyTorch soon – it’ll be cheaper and more interesting than the commercial object detection I’m using now.

A few observations:

  1. The chargrid representation appears to be only a little better than training the same model on raw document images, and in some cases it’s slightly worse. As a sanity check it’s worth trying to train a model on your source images in parallel with chargrid.

  2. I tried RGB chargrids instead of grayscale, assigning each of the top 50 most common characters in my dataset a value between [0,0,0] and [255,255,255]. I thought this larger embedding space would help the model be more specific, but the results were significantly worse than with the grayscale style above. This was a surprise to me. They look pretty though:

    rgb_chargrid_sample

Also worth checking out this undergraduate thesis by Timo Denk, a student supervised by Christian Reisswig, one of the Chargrid authors. It’s mainly about extending chargrid to a “wordgrid” based on BERT embeddings, but it goes into more implementation details about chargrid than the original paper:

https://www.researchgate.net/publication/335715433_Wordgrid_Extending_Chargrid_with_Word-level_Information

BERTgrid paper based on that thesis:

Hope some of that’s useful.

–Michael

4 Likes

(Phuc Ng. Su) #14

Hi @thebenedict,
Thank you very much for your response, great sharing information.
I have some concerns:

  • Can you show the ground truth mask? In the chargrid paper they mentioned that chargrid outperforms image-only model on header items, those have small segmented area. For the items which cover big area, both models output very similar result.
  • Do you use pretrained model and resnet backbone? or you replicated exactly the same model as described in chargrid paper.

And also I found a paper from a team in Vietnam which used “chargrid representation + Couple Unet + Self-attention + MultiStage” . Interesting to read but I haven’t implemented yet.

1 Like