Over on the SF Study Group thread, @binga is suggesting making language models for multiple languages.
EDIT This post has uh kind of diverged (mostly through my fault) into a discussion of the philosophy of NLP and word embeddings. Fortunately, @lesscomfortable made a thread
which has a place to sign up to make models for different languages. You should probably go there instead of reading through everything here.
I think this is a great idea and shouldnāt just be limited to SF.
I think there is really exciting stuff going on in the area of natural languages. Since 31 Oct 2017, there have been THREE different papers on taking two monolingual word embeddings and making a bilingual dictionary out of them:
- [1711.00043] Unsupervised Machine Translation Using Monolingual Corpora Only
- https://arxiv.org/abs/1712.06961
- [1801.06126] Non-Adversarial Unsupervised Word Translation
Basically, it turns out the the āshapeā of the point cloud in an N-dimensional word embedding space looks similar for different languages. This sort of makes sense. The axes of word embeddings seem to encode meaning, and if your N-dimensional space has one axis for āauthorityā, one for āmalenessā, and one for ātraditionā, then the words king, roi, ą°°ą°¾ą°ą±, and å½ē are all going to be high on authority/male/tradition. Another way of looking at it: because we are all human, all languages are going to have āearā and āeyeā and āmouthā be near each other, but none of them will have ācolorlessā, āblueā, and āideasā near each other.
(Note: most word embedding spaces do not have such clearly defined axes as āauthorityā, āmalenessā, and āauthorityā. My mental model of why is data compression. You can use the same word embedding axis for different meanings if they arenāt related (like āmale/femaleā, āpast/futureā, and āfeasible/infeasibleā).)
I suspect that language models will also have some strong similarities: I think all languages have nouns and verbs, for example.
This means that it might be that by translating the word embedding from English to e.g. Malayalam and replacing the English embedding in English-trained FitLaM with the Malayalam embedding, one might be able to just fine-tune with a small Malayalam corpus to get a good Malayalam language model.
Facebook has made available word embeddings for 157 different languages:
Word vectors for 157 languages Ā· fastText
For a given language, e.g. Malayalam, it might make sense to
- take Facebookās Malayalam word embedding,
- learn a transform to convert Facebookās Malayalam word embedding into FitLaMās embedding using one of the techniques from the three papers I referenced above,
- replace FitLaM embedding layer with the new Malayalam->vector layer
- fine-tune FitLaM with a small Malayalam corpus.