The slides from my lightning talk can be viewed here. I really look forward to your ideas for ways of using word embeddings to improve NLP for Urdu and other low-resource languages.
Great talk @samar ! I was wondering if I could share your slides on twitter?
Great one @samar. We worked on improving linguistic resources for Urdu and Hindi few years back and showed how these languages could benefit from each other. This was done in the pre-deep learning era, but it might still be useful.
From the Abstract:
In this paper we show that we can get reasonable quality translations (we estimated the Translation Error rate at 18%) between the two languages even in the absence of a parallel corpus. Linguistic resources such as treebanks, part of speech tagged data and parallel corpora with English are limited for both these languages. We use the translation system to share linguistic resources between the two languages. We demonstrate improvements on three tasks and show: statistical machine translation from Urdu to English is improved (0.8 in BLEU score) by using a Hindi-English parallel corpus, Hindi part of speech tagging is improved (upto 6% absolute) by using an Urdu part of speech corpus and a Hindi-English word aligner is improved by using a manually word aligned Urdu-English corpus (upto 9% absolute in F-Measure).
Thanks for sharing your input @thunderingtyphoons! The absence of parallel corpora really is one of the biggest obstacles in translation for our languages. I’m glad to see you were able to make do without one - I will definitely give your paper a thorough read!
I am no expert here but I am curious if we can leverage Urdu to English dictionary (or other languages with richer embeddings) and English embeddings to increase the vocabulary of Urdu word embeddings? Seems like this work uses kind of similar approach: https://arxiv.org/pdf/1602.01925.pdf
Interesting idea, @sravya8. I think the use of dictionaries in this paper is made primarily to find parallels between similar words across languages in the vector space and not to increase the vocabulary of words itself, since you are probably more likely to learn a larger vocabulary from a monoglinual corpus than that of a bilingual dictionary. However, what you suggest could be used to add rare words (contained in dictionary, absent in corpus) to the vector space, by sticking them in a position respective to that of the translated word in the other language. This is based on a first read of the paper, but I will look at it and cited works in more detail. Thanks for bringing it to my attention- word embeddings sure are awesome. Do let me know if you have any other ideas for their use!
Hi guys. I read your discussion about word embeddings. I have some thoughts and questions that I would like to share with you and my goal is to better understand how ULMFiT use embeddings. I read a lot of papers about my thesis and I am little confuse. So my questions are:
- What kind of embeddings ULMFiT use?
- Does anyone know what is region embeddings? Are LM a kind of region embeddings?