So I was confused by one particular line in the code
class Seq2SeqRNN(nn.Module): def __init__(self, vecs_enc, itos_enc, em_sz_enc, vecs_dec, itos_dec, em_sz_dec, nh, out_sl, nl=2): ### lots of code self.out.weight.data = self.emb_dec.weight.data ### code continues
I was confused how both the linear layer and embedding layer can share weights as they have different shapes
The linear layer is nn.Linear(300,len(en_itos)) whereas embedding is nn.Embedding(len(en_itos),300) .
So I went and inspected both their sizes.
It turns out
nn.Embedding(17573,300).weight.data and nn.Linear(300,17573).weight.data
both have the same size - [torch.cuda.FloatTensor of size 17573x300] .
That’s when I realized that when we are doing matrix multiplication we do WX+b (duh)
So W will have the shape (17573,300) . But why does embedding layer have the same shape? Because it isn’t a matrix. nn.Embedding is a lookup table. . You just query a word (one of the 17573 in this case ) and get back a 300 dim vector. That’s why Jeremy could tie both output embedding and output linear layer weights.
It all sounds simple to me in hindsight, but I was stumped by this for a while. So I hope this helps anybody who was confused about weight sharing in this code