torkku
(Aki Rehn)
March 3, 2017, 3:43pm
1
I’m still doing the CNN part of the course, but I’m pretty sure these might be useful for transfer learning in NLP when I get so far:
# Pre-trained word vectors
We are publishing pre-trained word vectors for 294 languages, trained on [*Wikipedia*](https://www.wikipedia.org) using fastText.
These vectors in dimension 300 were obtained using the skip-gram model described in [*Bojanowski et al. (2016)*](https://arxiv.org/abs/1607.04606) with default parameters.
## Format
The word vectors come in both the binary and text default formats of fastText.
In the text format, each line contain a word followed by its embedding. Each value is space separated.
Words are ordered by their frequency in a descending order.
## License
The pre-trained word vectors are distributed under the [*Creative Commons Attribution-Share-Alike License 3.0*](https://creativecommons.org/licenses/by-sa/3.0/).
## References
If you use these word embeddings, please cite the following paper:
P. Bojanowski\*, E. Grave\*, A. Joulin, T. Mikolov, [*Enriching Word Vectors with Subword Information*](https://arxiv.org/abs/1607.04606)
This file has been truncated. show original
davecg
(David Gutman)
March 3, 2017, 4:58pm
2
These are a few years old now, but you can also get pretrained vectors for Word2Vec on 100 billion words from Google News (vocab size 3 million, 300 dimensional vectors):
https://code.google.com/archive/p/word2vec/
Hey,
they updated the fastText model for 294 languages + added two tutorials:
Bests,
Benedikt
1 Like