Looking to create OCR for an Arabic like language. Don't know where to start

naveed · May 13, 2020, 5:04pm

So I am looking to create a OCR for Kashmiri, a language written in Arabic script. How do I get started and how to proceed.I tried using tesseract-ocr, which is not giving satisfactory results. I was thinking to maybe make one from scratch. So any help will be appreciated.

pstroe · May 15, 2020, 7:48am

hi naveed,

i’d suggest you try other ocr-software first before you try to build your own system. the challenge would be interesting for sure, but i’d go with established systems first to see what they are capable of. you mentioned tesseract, and having experimented with tesseract myself a bit, i found it cumbersome to work with. with the version i used, i had to produce several different input formats before training could start, and this i found overly complicated.

i’d suggest you take a look at kraken (http://kraken.re/). it’s open-source and very easy to use. you just need corresponding line image and text pairs and you’re good to go. the nice thing is, that it is written in python and it uses torch. so you can also look at the code and try to improve it, should you decide to build your own system. but it for sure would be easier to have kraken as a starting point.

hope this helps.

best, phillip

naveed · May 15, 2020, 9:59am

having a look around. will update you accordingly.