NLP in other languages (ancient books)-Week 4

The problem I like to tackle is not in English. In slide Jeremy explain started from wiki data to specified imdb data.
The problems is combination of image recognition and NLP
In any language there are old books that some words are not clear and An expert read those text and fill the gap. I have access to some of those files in Persian language before and after edit.
So what I think could be a great project is creating a model that could get picture of those text , covert picture to text and then use NLP to predict and fill unclear words.
I am a little unclear about which deep learning techniques are useful in broader picture , do I need to use deep learning OCR? Can I do this with fast ai library or I need to learn another library ?

Can I use Tesseract OCR as base in huggingface NLP model ?

The whole wiki data set Jeremy talk about is just English or contain other languages too?
I could find smaller Persian data set after wiki and for last part I have the smaller data set which I need to label.

I’m not an expert, but I know that this is an active topic of research and directly using OCR models trained on English will likely not work.

For example, see this recent thesis.

This video might give a nice overview of the topic, and the linked paper might point you to some interesting reading.

As a first attempt, using a model for Urdu (like this) might help as you are targeting Persian.

1 Like

@bahman_apl you can download different languages from the Wiki by following the instruction here:

You may find the thread useful as well. Others trained a Persian model previously. Language Model Zoo 🦍

1 Like