Does anyone have any tips or starting points for performing OCR with fastai v2?
My use case is the de-identification of DICOM medical images which have patients’ data burned in to the images.
Cloud-based OCR options from AWS, Google, Azure seem to perform very well - however they require sending patient data to the cloud which is not HIPAA compliant.
Therefore I’ve been tasked with creating a homemade OCR solution, which can sit on an owned machine, and I thought a deep learning model made with fastai might be a good candidate to solve this.
It looks like I’d be able to make a classifier for identifying individual characters fairly easily. It could possibly be fine tuned for my data set (it’s all computerized text, no handwriting; in a particular set of fonts used in DICOM).
I would think the other crucial thing to solve, is how to extract text regions, and split them up into characters (which are unknown), which would then be passed into the classifier for identification.
Any nice way to use fastai to extract the text regions/individual characters?
I have tried it, it works pretty well if the data is consistent (all text same size, same font) but poorly in other circumstances. For instance, sometimes text is overlaid on irregular backgrounds (the contents of a CT scan or xray) and it doesn’t get good results. Strangely enough the cloud variations handle that well.
I noticed the other thing you linked, Textract, I will check that out.
Still, I want to give fastai a go for the learning experience. I found this similar thread interesting, looks like I would need to use segmentation to extract the characters. Does anyone know how to do this? Does it require some ground-truth data, like in the camvid example in the lessons?
OCR with fastai sounds like an interesting project. I don’t have much experience with it myself, but have you looked into Smart Engines? They offer OCR software that can be deployed on-premise and is HIPAA compliant. It might be worth checking out as an alternative to building a homemade solution from scratch. However, if you do decide to proceed with fastai, it seems like you’re on the right track with creating a character classifier. As for extracting text regions and splitting them into characters, you might want to look into using techniques like contour detection and bounding boxes to identify regions of text. Best of luck with your project!
I haven’t worked with fastai for OCR specifically, but I’ve tackled a similar challenge of text extraction from images for identity verification purposes. From my experience, using advanced OCR technology like what’s offered by ID Analyzer can be incredibly effective. Their ID Verification API uses cutting-edge computer vision and AI to scan and accurately extract data from various identity documents. This technology impressively handles a wide range of document conditions and languages, showing exceptional accuracy rates, like 99.8% for English and even 98.5% for complex scripts like Chinese. For your DICOM images, the key might be developing a custom solution that can detect and extract text regions before feeding them into an OCR system. Considering the sensitivity of medical images, a solution that ensures data privacy and compliance, similar to the precision and reliability offered by Identity Verification technologies, could be crucial.