Any idea on how to implement deep learning model like rossum.ai ? I have tried object detections model on a few samples but the network does not seems to learn anything. I am trying to process invoice and extract a few key-valued pairs in it . Any idea or help is appreciated.
HI Neeb hope you are well!
I was building a classifier which required a lot of text to be recognized. I found it rather difficult just using the classification techniques from lessons 1 and 2.
I am working on something else at the moment but will come back to it at some point, however in my travels, I discovered a library that would be ideal for extracting the characters from the images.
Then you could build your model on the text extracted.
Have a jolly day.
Extracting Structured Information from Screenshots
Hi mrfabulous1, appreciated on your comments, I did build a template matching model using output from tesseract, but this method required a lot of rules and template pre-defined, which would not be a ideal case for me, since we have more than 100 different types of PDF, I am looking for a more generalised method to extract the text, but not sure the correct deep learning method to do this.
Hi Neeb, I’m working on a similar problem for a different type of document, and have been experimenting with the chargrid representation introduced by SAP Research last year. Here’s the paper (it’s been posted in a few other threads):
You may have seen that Rossum published a paper on the basics of their method, a different approach from SAP’s:
There may be heuristics, post processing, etc. that isn’t included in the papers but is important for making a production system work well. I think this type of domain-specific information extraction from documents is still an active area of research, and I don’t know of an open source tool that will do what you’re looking for end-to-end. I also don’t think there are open labeled datasets for invoice understanding, so training a DL model will involve some labeling drudgery.
That said, I’ve implemented part of the chargrid paper and have had some promising early results, so if you decide to work through it and have questions feel free to get in touch.
Hi @thebenedict, thanks for posting the link to both of the papers , I will start to look into them, infact I didn’t know that Rossum has published a paper, I am quite new to the CV community, can I know how you found the paper? Thanks for the help.
Thats a lot of Computer Science papers!
Arxiv Sanity Preserver
Hi, @thebenedict, I am trying to implement the network in table understanding in structured documents but I am stucked on the implementation part, anyone can explain to me about the convolution over sequence mentioned in the paper?
I am guessing they feed the network by sequence consist of wordboxes , but to do this we actually have to pad the input with certain size, but to do padding, we need masking for the network to ignore those pad value, and it seems like masking is not supported in conv2d which is presented in the paper. I am also guessing that ‘?’ in the network diagram is the sequence length?
Could anyone enlighten me on this?
I can’t help there, unfortunately. I’ve made some progress with chargrid-like document representations but I haven’t worked with the approach Rossum uses. If I come across anything that might be useful I’ll update here. Sorry I can’t offer more, good luck.
I have been working on a similar problem (document understanding and structured text extraction) and here are some resources/datasets I found useful:
- receipt data - unlabeled - has around ~8K receipts - https://expressexpense.com/view-receipts.php?page=1
- ICDAR earlier this year hosted a competition on document understanding - https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3 (labeled dataset and results from participants with some hints on their approaches are available). This competition was split into three tasks (text localization, OCR and key information extraction). You have to register to download the data and view the results.
I am implementing the chargrid paper, nice to talk with you. Have you achieved acceptable accuracy (>80%) with chargrid model?
How was your experience implementing chargrid paper ?. Can you share how the results are like.
I’ve implemented the chargrid document representation (or something like it), but I’m not using fastai to train models. My goal is to segment documents that look very different from invoices. That said, I have recall > 85% and precision >90% on my dataset, and here’s what a typical document looks like in case it helps:
I’m hoping to get back to fastai/PyTorch soon – it’ll be cheaper and more interesting than the commercial object detection I’m using now.
A few observations:
The chargrid representation appears to be only a little better than training the same model on raw document images, and in some cases it’s slightly worse. As a sanity check it’s worth trying to train a model on your source images in parallel with chargrid.
I tried RGB chargrids instead of grayscale, assigning each of the top 50 most common characters in my dataset a value between [0,0,0] and [255,255,255]. I thought this larger embedding space would help the model be more specific, but the results were significantly worse than with the grayscale style above. This was a surprise to me. They look pretty though:
Also worth checking out this undergraduate thesis by Timo Denk, a student supervised by Christian Reisswig, one of the Chargrid authors. It’s mainly about extending chargrid to a “wordgrid” based on BERT embeddings, but it goes into more implementation details about chargrid than the original paper:
BERTgrid paper based on that thesis:
Hope some of that’s useful.
Thank you very much for your response, great sharing information.
I have some concerns:
- Can you show the ground truth mask? In the chargrid paper they mentioned that chargrid outperforms image-only model on header items, those have small segmented area. For the items which cover big area, both models output very similar result.
- Do you use pretrained model and resnet backbone? or you replicated exactly the same model as described in chargrid paper.
And also I found a paper from a team in Vietnam which used “chargrid representation + Couple Unet + Self-attention + MultiStage” . Interesting to read but I haven’t implemented yet.