Lesson 12 NLP Deep Dive

sanjeevurao · May 28, 2021, 8:51am

How is the final layer (self.h_o) in the below model predicting the next word? Dont we need an activation function like softmax to get the probabilities?

 class LMModel1(Module):
        def __init__(self, vocab_sz, n_hidden):
            self.i_h = nn.Embedding(vocab_sz, n_hidden)  
            self.h_h = nn.Linear(n_hidden, n_hidden)     
            self.h_o = nn.Linear(n_hidden,vocab_sz)
            
        def forward(self, x):
            h = F.relu(self.h_h(self.i_h(x[:,0])))
            h = h + self.i_h(x[:,1])
            h = F.relu(self.h_h(h))
            h = h + self.i_h(x[:,2])
            h = F.relu(self.h_h(h))
            return self.h_o(h)

BobMcDear · May 28, 2021, 9:58am

Hello,

You are right, softmax does need to be applied to the output of the model. However, if you take a look at the training,

learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

The loss function is F.cross_entropy, which combines log softmax (in lieu of regular softmax for better numeric performance) and negative log likelihood loss, thus your model does not require a final softmax layer during training.

For inference, the code in the lesson is,

n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n

There’s no softmax, but that is because we’re interested in merely the hard prediction (i.e. the category), not the probabilities. Since softmax does not change order (the highest value before softmax would still be the highest value after softmax, the second highest value before softmax would still be the highest value after softmax, and so on), we may directly take the argmax of the model’s output.

Have a great weekend!

sanjeevurao · May 29, 2021, 2:17am

Understoood. Thanks for your explanation @BobMcDear .