Language models for multiple languages

Over on the SF Study Group thread, @binga is suggesting making language models for multiple languages.

EDIT This post has uh kind of diverged (mostly through my fault) into a discussion of the philosophy of NLP and word embeddings. Fortunately, @lesscomfortable made a thread

Language Model Zoo :gorilla:,

which has a place to sign up to make models for different languages. You should probably go there instead of reading through everything here.

I think this is a great idea and shouldn’t just be limited to SF.

I think there is really exciting stuff going on in the area of natural languages. Since 31 Oct 2017, there have been THREE different papers on taking two monolingual word embeddings and making a bilingual dictionary out of them:

Basically, it turns out the the “shape” of the point cloud in an N-dimensional word embedding space looks similar for different languages. This sort of makes sense. The axes of word embeddings seem to encode meaning, and if your N-dimensional space has one axis for “authority”, one for “maleness”, and one for “tradition”, then the words king, roi, రాజు, and 国王 are all going to be high on authority/male/tradition. Another way of looking at it: because we are all human, all languages are going to have “ear” and “eye” and “mouth” be near each other, but none of them will have “colorless”, “blue”, and “ideas” near each other.

(Note: most word embedding spaces do not have such clearly defined axes as “authority”, “maleness”, and “authority”. My mental model of why is data compression. You can use the same word embedding axis for different meanings if they aren’t related (like “male/female”, “past/future”, and “feasible/infeasible”).)

I suspect that language models will also have some strong similarities: I think all languages have nouns and verbs, for example.

This means that it might be that by translating the word embedding from English to e.g. Malayalam and replacing the English embedding in English-trained FitLaM with the Malayalam embedding, one might be able to just fine-tune with a small Malayalam corpus to get a good Malayalam language model.

Facebook has made available word embeddings for 157 different languages:

For a given language, e.g. Malayalam, it might make sense to

  • take Facebook’s Malayalam word embedding,
  • learn a transform to convert Facebook’s Malayalam word embedding into FitLaM’s embedding using one of the techniques from the three papers I referenced above,
  • replace FitLaM embedding layer with the new Malayalam->vector layer
  • fine-tune FitLaM with a small Malayalam corpus.

Also: I mentioned on another topic that there is a paper which deals with the problem of polysemy, i.e. of one word meaning two different things. (Is a bank a place to put money or the thing at the side of a river?)

Word embedding axes are difficult to interpret; as I mentioned above, I think the axes are overloaded. Non-negative sparse embeddings ( make the axes more interpretable; I have a hunch (that is completely not backed up by any data) that if you have more interpretable axes, you will get better results from your language model.


Another approach that might be interesting:

  • take a well trained model (Jeremy’s wiki103 for example).
  • train only the embedding layer on one language
  • train a new embedding layer on another language
  • compare the embeddings, look for nearest neighbors, cosine similarity, do a t-sne on some subset of embeddings from both languagues (this would be cool to portray visually if similar words from both languages are close in some sense)

My guess that keeping the higher levels frozen might drive interesting similarities. Would be interesting to check :slight_smile:

Would also be cool to see with these non linear models what the relationships between embeddings even within a single language are. Meaning are synonyms clustered together and if there is anything interesting in going from synonyms to antonyms. Are there any similarities in direction / distance when we go from good -> bad, strong -> weak, etc.

NB: Ducky added the numbers to the list.

I’m a bit confused here. In #3, how are you proposing to train another embedding layer? Are you suggesting freezing all of the #1 language model except for the embedding layer? If so, I think that will be a bit of trouble because the language model basically is learning grammar (I think), and the grammar of different languages is going to be very different. But yes, it might be interesting way to force different languages into the same embedding space.

I’m not sure if you mean for #2 to be non-English or not.

In #2, you say “train only the embedding layer on one language”. If you mean “separate from the language model in #1”, then that work has already been done – Facebook has 157 language embeddings.

I think providing a ‘model zoo’ of LMs would be absolutely awesome.

For the vast majority of languages, I don’t think any transfer learning is needed to create the best possible language model, since it’s so easy to create a large corpus of text (just scrape a bunch of news sites, government sites, etc). So I’d suggest just getting started grabbing a corpus and start training! :slight_smile:

(often you can even download a pre-created corpus, such as this 16 mil word malayalam corpus: )


Oh and if some of you do start creating LMs in your languages, I’m happy for to host them and provide a site to make them discoverable.


That was the idea :slight_smile: Thank you for pointing this out - I didn’t think about the grammar part.

I’d love to contribute to this effort! What an awesome community!

I am interested in contributing to this effort. I am based in SF. What are the potential next steps? Maybe I can contribute to Bengali language.

1 Like

Make a Bengali language model. :wink:

I think the most straight-forward steps are:

  1. Find a corpus for Bengali
  2. Create a Bengali word embedding (or grab one from Facebook You
  3. Use to generate a language model, using the embedding you got from step 2.
  4. Save the model weights.
  5. Send the model weights to @jeremy.

OR try the thing I was suggesting in my first post on this topic, where you learn the mapping between the Facebook Bengali word embedding to the wikipedia103 embedding used in I think that would take more GPU cycles to process, but far less learning on your part.

for more.


Absolutely! To get the ball rolling, I have been doing some reading on this area yesterday and I found a blog which is quite comprehensive of the general approaches.

To summarise the post, Machine Translation is one of the obvious downstream tasks for creating LM / word embeddings has been traditionally seen as a supervised problem. However, due to lack of adequate number of parallel corpuses, a number of research papers talk about Cross-lingual embeddings and Adversarial Training. I’m not aware of the adversarial training techniques but the cross-lingual embeddings approaches talk about computing a metric that essentially measures the similarity of the distribution of the points in the N dimension space.

Facebook has a library for this:

And maybe once we converge, the embeddings similar to one another on the common space mean something to each other?? I’m just thinking out aloud!


This is a really awesome initiative, count me in to contribute as well!

You don’t need a word embedding. I haven’t seen any improvement by using a pretrained word embedding.


None of this complex stuff people are talking about in this thread is necessary and isn’t likely to help! :smiley: All you need to do is grab a corpus and train a language model. That’s it! Seriously :slight_smile:


We spent too much time praising embeddings since the movielens days that it’s hard to unlearn overnight.

The other confusing piece is an embedding layer inside the LM.

Just remember - an embedding is just a regular linear layer for one hot encoded inputs. It’s not special in any way!


I get it Jeremy…I’m just commenting on all the love embeddings are getting on the forums today.

@jeremy, if word embeddings aren’ important, why do we go to the trouble of loading up the wikipedia103 “encoding” (which, near as I can tell, is a word embedding) in (I’m trying to understand, not trying to attack.)

Is the encoding _not_actually a word embedding?

Embeddings are just vectors that are output from an embedding layer(dense/linear layer of neurons). The quality of the embeddings produced by that layer, are what we are really after.

Now, the quality of the embedding layer is heavily dependent on the preceding layers(the rest of the model) and the objective fn of the model.

In the case of word2vec, the whole model was just 1 linear layer. And the objective fn was also slightly weaker than that of a full language model.

In our AWD lstm LM, we have multiple layers of lstms which will beat a single layer model easily.

Other than that we have nothing against the poor embeddings. We love our vectors!

To add to the confusion, lots of websites actually call embedding vectors “models”, and some even call them “deep learning”…