Questions about OCR - recognizing computer-generated text from digital posters

ytian22 · December 3, 2017, 6:33pm

I have a project which is to recognize text from images such as digital posters or ads. This kind of images are originally created by computer with different fonts and sizes. I wonder whether there is annotated data available online for character segmentation or character recognition that is particular for this case?

Most of the information I found is about hand-written text like MNIST database, license plate OCR, and printed text such as invoices. But text in digital posters have different fonts, colors, sizes and backgrounds. (Unfortunately it all depends on the merchant/stores that created those images, probably with Photoshop, Adobe Illustrator or other graphic designing tools)

I am currently working on character segmentation with openCV, but would like to see if I got any luck to find someone who is familiar with this particular case. Thanks in advance for your time! Any opinion or advice is greatly appreciated!

jeremy · December 3, 2017, 7:05pm

Multi-font OCR is really hard! Are you just trying to find/segment the text, or to actually read it?

helena · December 3, 2017, 7:07pm

one of the approaches i found promising is to use the synthetic data - as suggested by this paper Reading Text in the Wild with Convolutional Neural Networks; you could adopt this approach to generate your own images (along with appropriate labels, bounding boxes or whatever) with free fonts

init_27 · December 3, 2017, 7:31pm

Although I can’t say about the accuracy strictly. I’ve tried working with OpenCV and gotten pretty decent results on a character reader, on different fonts.
I had used really basic algorithms like feature matching and image arithmetic to extract data.

Edit: OpenCV seemed to perform fine for an OCR for reading college papers and a ANPR model.

ytian22 · December 3, 2017, 7:34pm

Thanks for your response!

The ultimate objective is to get text from the image. An approach I have is using pytesseract directly on pre-processed images, but the result is not accurate and would need a lot post processing work.

My ideal solution is to get character segmentation accurately first (using either computer vision tools or deep learning) and then use CNN to recognize the character. Initial thought is making all segmented characters normalized with same height and width, skewness adjusted and then I can label them manually if necessary. But as a premise, for the segmentation part, if I have to deal it with neural nets, the training data is a problem as I’m not sure how much time I would spend to generate my own annotations.

I also thought about using pytesseract directly and then focusing on post-processing with inaccurate text, but I haven’t do many research on RNN LSTM.

ytian22 · December 3, 2017, 7:37pm

Thank you Helena for helping!

jeremy · December 3, 2017, 7:38pm

I think DL for segmentation should work fine - any standard segmentation approach should be able to find the characters. As @helena says, you’ll probably want to generate data for the actual OCR. There’s a great walkthru here BTW https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/

ytian22 · December 3, 2017, 7:42pm

Thank you for helping! Do you use OpenCV only for recognizing?

For segmentation I tried to use thresholding but if in 1 image there are text in different color, openCV will behave weird. I found an OpenCV tutorial “OCR of Hand-written Data using kNN” and planned to try tomorrow.

ytian22 · December 3, 2017, 7:48pm

I see, thank you very much! This is also what you recommend me when I first asked about OCR here I will read it again as the first time I was brand new to this area and didn’t understand most of the contents. Really appreciate your help!

init_27 · December 3, 2017, 7:49pm

The project was to detect licence plate numbers by using a low powered device (Raspberry Pi) and I had used OpenCV to detect, localise and segment the characters.

For the colour issue I found turning the image into a high contrast B/W image and then applying threshold was useful.

Detection, Segmentation and localisation-Open CV
To improve the recognition part- Training a custom NN will be the ideal go.

ytian22 · December 3, 2017, 7:55pm

Got it, thanks! Yeah in some image “opposite” colors are used for different text, which results in white-text-black-background and black-text-white-background at the same time. I literally had no words when I found this issue.

init_27 · December 4, 2017, 3:52am

In that case (assuming that the words of the same colour are together) masking a colour helps in distinguishing and then thresholding worked for me.

init_27 · December 4, 2017, 3:48pm

@helena
I’ve tried using the PiCam that comes with the board, it worked well.
If you’re strictly concerned with distance detection, you could use the Ultrasonic distance detector module.

helena · December 4, 2017, 3:52pm

thank you! sorry was trying to edit my question and got it deleted - i’m basically just starting this project - and my camera knowledge is mostly based of what i remember from udacity self driving car Slack discussions

adonese · December 6, 2017, 7:12pm

Have you tried Tesseract with LSTM. It is still on beta, though should give you better results than the standard one. I have not checked their code/implementations, but even the standard version worked well for me.

ytian22 · December 6, 2017, 10:28pm

This is really helpful! I will test the effect of this version. Thanks very much!

vineets · April 10, 2018, 6:53pm

Hello
Were you able to get good results with any of the methods. I have a use case to detect text in screenshot images.
I would appreciate if you can let me know.
Thanks
Vineet

ytian22 · June 5, 2018, 5:57am

Hey sorry for the late reply. I found Pytesseract (free) or Google Cloud Vision API perform very well if you don’t have enough resources to build an end-to-end OCR pipeline. Some post-processing will be needed though. Hope this helps.

clipmaker · June 8, 2018, 8:31am

You should look at ocropus. I don’t know if it’s still going on, but it made interesting progress and had a good high-level approach at the time.