Here is an interesting problem to solve:
It is a Facebook post from Dr. Thomas Issac, Finance Minister of Kerala - a small state in India. The post is in a language called Malayalam. Please use translation to read the full post, but here is the summary.
In Kerala, every family has a card called Ration Card, using this card one can get subsidised food items from shops called Ration Shops. There are different kinds of cards based on the income of the ration card owner and family. (see here - http://english.manoramaonline.com/news/kerala/2017/05/30/colour-coded-ration-cards-kerala.html)
For now, assume that, there are 2 kinds of ration cards.
- Card for low income families
- Card for general category
The first section of card holders get more subsidy than the others. Now, people disclosed their income very low, and tried to get low income ration cards. This has a big financial burden on the govt. Govt wants to remove such families from low income ration card holder’s list. They decided to remove all the families who have a car or a building in their name, and decided to check the address of ration card holder against the address given in building registration and vehicle registration.
- 5 million people who has low income ration cards.
- More than 9 million addresses in building registration.
- More than 4 million addresses in vehicle registration.
5 million addresses should be compared against 13 million addresses. Currently, one person is doing this search manually. Her name is Aju Saigal, You can see her photo in the post. She is a computer engineer, so I am sure she is using some kind of algorithms to compare the addresses.
But here is the problems,
- Ration card addresses are in Malayalam.
- Building and Vehicle registration addresses are in English.
- Since the names of houses, streets and places are actually in Malayalam, when converted to English, people tend to use different spellings for same place names.
Typically, an address contains these fields - name, house name, street name, post office name, district name and postal code. Some fields may not be present, some time additional fields are also added.
What is the best way to compare addresses in 2 different languages and find similarities?
(I don’t work for govt and govt will never open these data, and there is no competition for this problem, I am just asking this question out of my curiosity. I think Deep learning algorithms can solve this problem. Address2Vec?? LSTM or Conv1D?? Looking for some suggestions. May be this community can teach Aju Saigal a better approach to solve this problem.)