Good heavens no! The wt103 LM is far far more than a word embedding. I discuss this in the lesson, so try watching it again if you have a chance, and let us know what you find.
Sorry, I meant the encoding layer,
0.encoder.weight, not the whole language model. We do a bunch of shuffling around of the weights so that the embeddings are right for the IMDB word integers. If the embedding is not that important, why not just initialize it to random and train it up?
I think that’s what you are suggesting for e.g. Malayalam: don’t bother using a pre-trained encoding, just train it up. I’m trying to square that with us bothering to use the pre-trained encoding for English in
imdb.py. What is different?
I think you still benefits if you start from a pre-trained weight. Just like what we do for transfer learning in computer vision? You first fine-tune the last layer and then train the whole model.
A embedding layer is trained with a shallow network, roughly speaking it is also a special pre-trained model that dropped the output layer with 1 layer left only. A pre-trained language model has more non-linearity and thus richer information it can represent.
So why does @jeremy say not to bother with the Facebook pre-trained embeddings? I am genuinely confused here. It doesn’t seem like it would be a lot of work to bolt one on a model, yet he seems to say not to bother.
Ah thanks for explaining!
Because we’ve pre-trained a whole language model, the weights of each layer work closely together. We can’t just replace a layer with random weights and expect it to keep working.
The first layer is a linear layer that uses a one-hot encoded input - we call this an ‘embedding’. Since it’s just a regular layer, and it was part of a pre-trained model, we can’t initialize it to random weights.
However, our first layer weight matrix is incomplete - it doesn’t include entries for some of the words in our new (IMDB) corpus. Therefore we have to initialize these missing entries somehow.
This is largely orthogonal to your other question, which is: why don’t we use word vectors from something like word2vec in creating our initial pre-trained language model? And the answer is: our language model already will almost certainly have enough data that these won’t help at all. (They won’t hurt, mind you - it’s just an unnecessary complexity). Remember, stuff like word2vec were created using a linear model, and they weren’t created for our purpose, which is being the first layer of a language model.
So between the twin issues of already having enough data to not need word2vec, and that word2vec is a simple linear model not that related to our task, we wouldn’t expect them to help.
Does that explain things a bit better?
Sorry late to the party.
I speak Malayalam, and one of my intention of joining this course was to make AI tools/datasets available for Malayalam.
We dont have enough datasets available for Malayalam. Some of govt institutions and universities have it, but they dont open source it. We are yet to learn ‘open data’ culture.
Previously I used this tool https://github.com/facebookresearch/MUSE - from Facebook research to align monolingual word embeddings. Using it, you can align word embedding from one language to another language. I think it works too, I checked the word vectors cosine similarity between the word vectors, and could get good results between Hindi and Malayalam. I am yet to try an end-to-end deep learning model using it.
In our case, as Jeremy pointed out, we dont need word embeddings anymore. (But we may have to create a similar tool ‘align language models’, if that is something can be done.)
@binga I assume you have used Wiki dataset to train your model. Did you check the way Facebook trained their latest fasttext word embeddings ? They used common crawl dataset, and separated in to different language dataset, running a language detector on it. This way they improved the number of unique tokens available in languages like Malayalam. See here - https://arxiv.org/pdf/1802.06893.pdf
Can we make similar language dataset from Common Crawl? I dont have a multi-GPU setup to run such a big data like Common Crawl and separate multi-language contents from it. If we can separate it, that is also going to be big contribution to languages like Malayalam.
Okay, let me see if I can restate. I will follow up in a minute.
- In order to train the rest of the language model well, we need a large corpus from that language. In fact, it needs to be so big that it will be plenty big enough to train the embedding layer.
- Something like word2vec is differently than how we train our word embedding layer, so if you slapped word2vec weights into our word embedding layer, then those weights would probably get significantly changed anyway.
The first assertion seems to imply that the amount of data you need to train N layers from scratch is the same as the amount you need if you train N-1 layers from scratch. Maybe this is true, but it is is counter-intuitive to me. My mental model of why we freeze everything but the last layer of e.g. dogs&cats to train e.g. horses&zebras is partly because then we need a whole lot less data than if we were going to train the whole thing on horses&zebras.
Perhaps you are really saying, “There are gobs and gobs of words available and compute time is cheap, so you shouldn’t worry about being slightly wasteful.” While it is true that there are gobs and gobs of words available for many languages, that’s not necessarily true for all languages. (I bet Fula doesn’t have a large corpus. )
The second assertion implies that word embeddings created in different ways will be fundamentally different. Maybe this is true, but this also doesn’t match my intuition.
My intuition is that there exists a high-dimensional Ur-embedding in which the axes correspond to dimensions of meaning, like male/female, high/low status, or high/low tradition/formality. Thus “king” would be high on male/high-status/high-ceremony, while “queen” would be the same but low on male.
Why are the axes of the typical embeddings so uninterpretable? Imperfection, polysemy, and data compression, I think.
- Data compression If you look at the work Murphy has done on Non-negative sparse embeddings, with a simple transformation, you can make the axes much more interpretable at the cost of having more axes/dimensions. Or, if you “overload” the axes, you can use the same axis for two different qualities which don’t cause ambiguity. For example, since you never use “quickly” to describe nouns, you could use the same axis for “quickly/slowly” as you do for “male/female”. And then you can rotate the axes to make them fit your space better. (Or because the models don’t care if they are rotated.)
Polysemy Is a
banka place you put your money or the thing at the side of a river? Techniques to create word embeddings typically end up kind of averaging the two meanings to place the one point, which is just wrong.
- Imperfection I’m sure we just aren’t all that good at it yet.
So if I am right, and that word embeddings capture linguistic meaning, then it shouldn’t matter too terribly much which technique you use, as long as the loss function enforces meaningful sentences happen. (And in fact, that’s what we see… BOW, skip-grams, whatever, it just doesn’t seem to matter all that much.)
“More research is needed.”
It matters a very great deal. The very idea that there’s such a thing as a “word embedding” in a general sense seems extremely questionable at this stage. I’m having a lot of trouble really understanding the concerns and suggestions you’ve stated since they seem just so tied up in the idea that the first layer’s weights are somehow particularly special and interesting. Whilst this is what people used to assume (at least implicitly) it’s not what CoVE, ELMO, and my recent work with Sebastian show.
That’s close - but it’s not that there’s lot of words available, but lots of language (i.e. documents) available.
I made the point about (nearly) dead languages in class - my comments about LM training only apply to currently-used real-world languages that are used by a reasonable number of people that use computers. i.e. languages used for at least a couple of thousand pages on the internet or similar.
Could you please help us understand what your definition of “linguistic meaning” is? Because the way I see it - word embeddings don’t certainly capture a lot of “meaning”, they tend to place similar tokens at similar distances in high dim spaces. Nothing beyond that.
I certainly do not deny that it’s not useful: Spotify has music embeddings, Flipkart has product embeddings, and there a bunch of word embeddings used in some narrow domains (I use them in my products too).
I don’t know as embedding layers are “particularly special” – I agree that they are just one layer. I accept and agree that it is not essential that you have a pre-trained embedding layer like Facebook provides. It’s just that they are right there, ripe for the taking, why wouldn’t you use one? It seems like using one is sort of like setting the
r is not in
enc_wgts. You could just set
enc_wgts[r] to zero if
r is not in
enc_wgts, but why wouldn’t you use the mean if you have it?
I understand that “all the embeddings do” is put similar tokens at similar distances in high dimensional spaces. You would think that wouldn’t be particularly interesting. However, something is going on: you can do a whole lot more with word embeddings than you would think from that. Example1: you can do a passable job (even with our polysemantic word embeddings!) of discovering analogies (given
king, use the embedding to come up with
queen). Example2: You can construct passable bilingual dictionaries using monolingual corpuses by using word embeddings.
So even though, yeah, word embeddings are (usually) made just by figuring out similarities, there are hella interesting emergent properties that come out. Me, I think that is what we colloquially term “meaning”.
Well, the proof is in the pudding is in the experimental results, so give it a go and see what you find! The only reason not to do it is that it’s a little extra work to go find and download and use word embeddings, and if there’s any improvement then that work is almost certainly worth it. My guess is that for modern languages in regular use that it won’t help at all, assuming you train a reasonable LM in a reasonable way, but I’d very much like to see whether my guess is right or not…
(It’s possible the LM training may be a tiny bit faster - I’m not sure - although I don’t know that counts as a win since each language only needs this done once, then we can share the LM in our upcoming LM model zoo!)
…which I think are now obsoleted by the end2end fine-tuning that we learnt in the last lesson. Although we don’t have experimental results on seq2seq or sequence labeling tasks so we can’t be sure yet.
Yeah, I was coming to the realization that I’m going to have to do that. In my copious spare time.
Oh, don’t tell me that until after I finish the papers!
You have a whole refugee family there that you can now teach deep learning to and they can act as your grad students. I was assuming that was your cunning plan all along…
LOL! Well, helping them find a job is one of my team’s responsibilities…
Lot of respect, Kaitlin.