Why do we divide the embedding by 3?

nima · November 29, 2016, 12:42am

In the create_emd function, we divide emb by 3. I don’t recall the reasoning behind this – why is that?

def create_emb():
n_fact = vecs.shape[1]
emb = np.zeros((vocab_size, n_fact))

for i in range(1,len(emb)):
    word = idx2word[i]
    if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
        src_idx = wordidx[word]
        emb[i] = vecs[src_idx]
    else:
        # If we can't find the word in glove, randomly initialize
        emb[i] = normal(scale=0.6, size=(n_fact,))

# This is our "rare word" id - we want to randomly initialize
emb[-1] = normal(scale=0.6, size=(n_fact,))
emb/=3
return emb

jeremy · November 29, 2016, 12:53am

That, and the 0.6 scale, are horrible hacks. I found that the stddev of the glove vectors was about 0.6, so I used that for my randomly generated vectors. Then I divided them all by 3 to get similar weights to what glorot initialization would provide, IIRC. I can’t quite remember the details however - I just threw it together one night and forgot to actually document what I’m doing. You shouldn’t assume I did a good job of this, so feel free to play around with different scales.

(Although today I’ll show you a better way to handle this!)

nima · November 29, 2016, 1:09am

Cool, thanks.

Turns out, not dividing at all doesn’t change the performance of the models by anything noticeable (as far as I’ve tested so far)