Why do we divide the embedding by 3?

That, and the 0.6 scale, are horrible hacks. I found that the stddev of the glove vectors was about 0.6, so I used that for my randomly generated vectors. Then I divided them all by 3 to get similar weights to what glorot initialization would provide, IIRC. I can’t quite remember the details however - I just threw it together one night and forgot to actually document what I’m doing. You shouldn’t assume I did a good job of this, so feel free to play around with different scales.

(Although today I’ll show you a better way to handle this!)

4 Likes