Exploiting Language Models for OCR

To tell the truth I was blown away at Lesson 4 watching how transfer learning can work on NLP. I am working on making OCR work for a project. THe best mechanism turns out to be a CNN-LSTM. Take the feature vector output of a CNN pass it to a LSTM network apply CTC loss, and you have the output. My question is can we apply transfer learning to use a CNN-LanguageModel for OCR?


I was thinking about this for some time but I haven’t got the data to play with it.

The ideal would be to use and end-to-end model but then you would have to generate the images so you know what is written on them (dropbox used this approach), then you don’t need language modeling that much.

The other option is to use OCR as a text generator and then train ULMFiT on that text and possible other sources of similar texts (like documents that weren’t scanned), and then train classifier or NER on top of that.

Thanks for the reply. Sorry, perhaps, I failed to explain my question well. I want the best of the two worlds that you describe. I have read what dropbox did. First they applied MSER to extract bounding boxes (this works well for Latin scripts; but requires very brittle fine tuning for other scripts like arabic, chinese or devanagari). We can use SSD or YOLO3 for this part.

On the next part as you said,

This is where I want to check, whether pretrained language models can shine. Eg. If I am doing Russian OCR, then I can feed an image containing Cyrillic text as image to a CNN, it produces a activation map, I flatten it and send it to a Language model (instead of LSTM or GRU cells, whom I have to heavily train, whereas here it might be a matter of fine tuning), and it produces the text. This at least on paper, seems less computationally expensive.

Now what concerns me is how to use differential groups(the principle that first layers need less fine-tuning than the later layers; I always forget the new name Jeremy gave to this process) for applying transfer learning to this. If this works it can be used in things other than OCR like visual question answering, this will be like a standard model for tasks where the input is an image and output is a text.

So starting from a more or less standard CRNN (CNN layers + LSTM top) OCR model, what I did
was to train a character based language model separately and during the beamsearch combine the scores from both (and you can weight those). I like probabilities, so I converted both to log probs and used those, then you can have some interpretation of the Char LM as a prior (I think I saw a blog post/article regarding spelling correction that did something similar).
For me, that improved recognition results.
An alternative might be to get word candidates and use a word level LM.

Best regards



Hey, would you be able to share how you’re able to get the AWD-LSTM to accept the CNN’s feature vector as input?


I failed I miserably failed. The reason is for the CTC loss. It requires a particular format in which one must divide the Avg pooled output in a particular fashion much like in GroupNorm. Moreover AWD-LSTM lacks a forward method where there is some place for sending a hidden layer vector ( which would have been the output from the CNN).
def forward(self, input:Tensor, from_embeddings:bool=False)->Tuple[Tensor,Tensor]
QRNN has a different forward though, def forward(self, inp, hid=None). But as far as I have read QRNN requires cuda. Hence I guess I can’t run the production model in CPU. Being said that, an Encoder-Decoder architecture on AWD-LSTM will really be helpful. Wasn’t part 2 v2 translation based on AWD-LSTM?