Aligning images of text


I have been trying to create models for analysing images of printed text. As a starting point I have been taking rotated images and creating a model to make them the correct orientation for reading. I downloaded some images of normally aligned text and then applied rotations to create a training and validation set.

As a human this is simple. I can instantly see the pattern of rows of letters with each row separated by whitespace. Yet I cannot get a CNN to converge at all. I have tried different architectures and different sized CNNs with zero success.

I know there are other models that have been used to do this. However can a simple CNN be configured to align text? If not why not? And if it can then how do I configure it?


Could you show us some examples?How do you train your models?Do you need deep learning, is it possible to solve your problem by basic image processing skills(binarize->find countour->convert countour to rectangle->find vertexes->perspective transform)?

If accuracy of basic image processing is not good enough or you must use deep learning for this task, I remember I have saw an example of license plate detection by deep learning, their inputs are images with license plate, outputs are coordinate of 4 vertexes .

I think this article use similar trick mentioned above


I downloaded 10K images from google image using a variety of search terms that found images of text normally aligned.
[“newspaper articles”, “magazine articles”, “newspapers”, “pdf text”, “text pages”, “text chapters”,
“text descriptions”, “typed documents”, “word documents”, “scientific papers”, “two column papers”,
“text wikipedia articles”, “text heavy websites”]
I then rotated them and used the rotated image as X and angle as Y.

Yes I have this working using image processing but I want to see if I could do it with deep learning for fun! Also using an algorithm based approach may miss things. For example if there are handwritten notes or the corner is blacked out then the simple algorithm fails. I figure deep learning should be able to handle edge cases better.

Thanks for the link. Will read in detail and see if can be applied.


Sounds interesting, please share your experiences if you got any progress.


Are you still interesting in this problem?I found an interesting paper, they try to solve similar problem–“Recovering Homography from Camera Captured Documents using Convolutional Neural Networks”, maybe this can help you, I would try to implement the algorithm too.


Thanks. Will read it. Have been trying neural networks on various applications relating to images of text and have not yet found anything that works.

Re aligning text I found a really neat and simple solution without a neural network. You try different rotations and maximise the variance of the rowsums.

(marc) #7

I would try using curriculum learning. Tie the rotation to the training step in a linear fashion and basically make the angle harder and harder to guess. Maybe at the start your angle is randomly picked with values between 0 and 30 degrees with a step of 10 and at the end of training you have a step of .1 instead.