Pre-trained word vectors for 90 languages, trained on Wikipedia using fastText

torkku · March 3, 2017, 3:43pm

I’m still doing the CNN part of the course, but I’m pretty sure these might be useful for transfer learning in NLP when I get so far:

github.com

facebookresearch/fastText/blob/master/pretrained-vectors.md

# Pre-trained word vectors

We are publishing pre-trained word vectors for 294 languages, trained on [*Wikipedia*](https://www.wikipedia.org) using fastText.
These vectors in dimension 300 were obtained using the skip-gram model described in [*Bojanowski et al. (2016)*](https://arxiv.org/abs/1607.04606) with default parameters.

## Format

The word vectors come in both the binary and text default formats of fastText.
In the text format, each line contain a word followed by its embedding. Each value is space separated.
Words are ordered by their frequency in a descending order.

## License

The pre-trained word vectors are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/).

## References

If you use these word embeddings, please cite the following paper:

P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)

This file has been truncated. show original

davecg · March 3, 2017, 4:58pm

These are a few years old now, but you can also get pretrained vectors for Word2Vec on 100 billion words from Google News (vocab size 3 million, 300 dimensional vectors):

https://code.google.com/archive/p/word2vec/

benediktschifferer · May 3, 2017, 12:37pm

Hey,

they updated the fastText model for 294 languages + added two tutorials:

Bests,
Benedikt