Lesson 12 NLP Deep Dive

How is the final layer (self.h_o) in the below model predicting the next word? Dont we need an activation function like softmax to get the probabilities?

 class LMModel1(Module):
        def __init__(self, vocab_sz, n_hidden):
            self.i_h = nn.Embedding(vocab_sz, n_hidden)  
            self.h_h = nn.Linear(n_hidden, n_hidden)     
            self.h_o = nn.Linear(n_hidden,vocab_sz)
            
        def forward(self, x):
            h = F.relu(self.h_h(self.i_h(x[:,0])))
            h = h + self.i_h(x[:,1])
            h = F.relu(self.h_h(h))
            h = h + self.i_h(x[:,2])
            h = F.relu(self.h_h(h))
            return self.h_o(h)

Hello,

You are right, softmax does need to be applied to the output of the model. However, if you take a look at the training,

learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)

The loss function is F.cross_entropy, which combines log softmax (in lieu of regular softmax for better numeric performance) and negative log likelihood loss, thus your model does not require a final softmax layer during training.

For inference, the code in the lesson is,

n,counts = 0,torch.zeros(len(vocab))
for x,y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab): counts[i] += (y==i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n

There’s no softmax, but that is because we’re interested in merely the hard prediction (i.e. the category), not the probabilities. Since softmax does not change order (the highest value before softmax would still be the highest value after softmax, the second highest value before softmax would still be the highest value after softmax, and so on), we may directly take the argmax of the model’s output.

Have a great weekend!

2 Likes

Understoood. Thanks for your explanation @BobMcDear .

1 Like