OCR with fastai?

Hi all,

Does anyone have any tips or starting points for performing OCR with fastai v2?

My use case is the de-identification of DICOM medical images which have patients’ data burned in to the images.

Cloud-based OCR options from AWS, Google, Azure seem to perform very well - however they require sending patient data to the cloud which is not HIPAA compliant.

Therefore I’ve been tasked with creating a homemade OCR solution, which can sit on an owned machine, and I thought a deep learning model made with fastai might be a good candidate to solve this.

If anyone has pointers let me know :slight_smile:

From looking at a couple articles,

Fastai Arabic Character Recognition
Devanagari Handwritten Classifier

It looks like I’d be able to make a classifier for identifying individual characters fairly easily. It could possibly be fine tuned for my data set (it’s all computerized text, no handwriting; in a particular set of fonts used in DICOM).

I would think the other crucial thing to solve, is how to extract text regions, and split them up into characters (which are unknown), which would then be passed into the classifier for identification.

Any nice way to use fastai to extract the text regions/individual characters?

Have you tried using a standard ocr library? I guess if the Text is not handwritten you should get Good Results with e.g. Tesseract.

I Built and use the following Docker Image for text extraction / ocr.

1 Like

Hi Florian,

I have tried it, it works pretty well if the data is consistent (all text same size, same font) but poorly in other circumstances. For instance, sometimes text is overlaid on irregular backgrounds (the contents of a CT scan or xray) and it doesn’t get good results. Strangely enough the cloud variations handle that well.

I noticed the other thing you linked, Textract, I will check that out.

Still, I want to give fastai a go for the learning experience. I found this similar thread interesting, looks like I would need to use segmentation to extract the characters. Does anyone know how to do this? Does it require some ground-truth data, like in the camvid example in the lessons?