Over on the SF Study Group thread, @binga is suggesting making language models for multiple languages.
EDIT This post has uh kind of diverged (mostly through my fault) into a discussion of the philosophy of NLP and word embeddings. Fortunately, @lesscomfortable made a thread
which has a place to sign up to make models for different languages. You should probably go there instead of reading through everything here.
I think this is a great idea and shouldn’t just be limited to SF.
I think there is really exciting stuff going on in the area of natural languages. Since 31 Oct 2017, there have been THREE different papers on taking two monolingual word embeddings and making a bilingual dictionary out of them:
Basically, it turns out the the “shape” of the point cloud in an N-dimensional word embedding space looks similar for different languages. This sort of makes sense. The axes of word embeddings seem to encode meaning, and if your N-dimensional space has one axis for “authority”, one for “maleness”, and one for “tradition”, then the words king, roi, రాజు, and 国王 are all going to be high on authority/male/tradition. Another way of looking at it: because we are all human, all languages are going to have “ear” and “eye” and “mouth” be near each other, but none of them will have “colorless”, “blue”, and “ideas” near each other.
(Note: most word embedding spaces do not have such clearly defined axes as “authority”, “maleness”, and “authority”. My mental model of why is data compression. You can use the same word embedding axis for different meanings if they aren’t related (like “male/female”, “past/future”, and “feasible/infeasible”).)
I suspect that language models will also have some strong similarities: I think all languages have nouns and verbs, for example.
This means that it might be that by translating the word embedding from English to e.g. Malayalam and replacing the English embedding in English-trained FitLaM with the Malayalam embedding, one might be able to just fine-tune with a small Malayalam corpus to get a good Malayalam language model.
Facebook has made available word embeddings for 157 different languages:
For a given language, e.g. Malayalam, it might make sense to
- take Facebook’s Malayalam word embedding,
- learn a transform to convert Facebook’s Malayalam word embedding into FitLaM’s embedding using one of the techniques from the three papers I referenced above,
- replace FitLaM embedding layer with the new Malayalam->vector layer
- fine-tune FitLaM with a small Malayalam corpus.