Basic RNN implementation

Hi All

I have a question regarding Model0 in 6-rnn-english-numbers(https://github.com/fastai/course-nlp/blob/master/6-rnn-english-numbers.ipynb) notebook. I couldn’t find a category for course_nlp so I used the Part 2 (2019) category.

In the notebook Model0 class is defined like so -

class Model0(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)  # green arrow; nv=40; nh=64
        self.h_h = nn.Linear(nh,nh)     # brown arrow
        self.h_o = nn.Linear(nh,nv)     # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x): # x.shape (64, 3)
        h = self.bn(F.relu(self.i_h(x[:,0])))
        if x.shape[1]>1:
            h = h + self.i_h(x[:,1])
            h = self.bn(F.relu(self.h_h(h)))
        if x.shape[1]>2:
            h = h + self.i_h(x[:,2])
            h = self.bn(F.relu(self.h_h(h)))
        return self.h_o(h)

Based on this image from https://github.com/fastai/course-nlp/blob/master/RNNs.pptx(slide 5)
Screenshot from 2020-06-23 07-34-39 shouldn’t the model class be defined as -

class Model00(nn.Module):
    def __init__(self):
        super().__init__()
        self.i_h = nn.Embedding(nv,nh)  # green arrow; nv=40; nh=64
        self.h_h = nn.Linear(nh,nh)     # brown arrow
        self.h_o = nn.Linear(nh,nv)     # blue arrow
        self.bn = nn.BatchNorm1d(nh)
        
    def forward(self, x): # x.shape (64, 3)
        h = self.bn(F.relu(self.i_h(x[:,0])))
        if x.shape[1]>1:
            h = self.h_h(h) + self.i_h(x[:,1]) #brown arrow + green arrow for word 2
            h = self.bn(F.relu(h))
        if x.shape[1]>2:
            h = self.h_h(h) + self.i_h(x[:,2])
            h = self.bn(F.relu(h))
        return self.h_o(h)

What am I missing here or not understanding correctly?

I found this also strange when I saw it and I agree with your adaption of the code. I thought that in a standard RNN you get the new hidden state by h = self.h_h(h) + self.i_h(x[:,i]) (followed by activation function). In this case, weight matrix self.h_h would learn a function of how to modify the hidden state from one time step to another given only the previous hidden state.

In the other case, when you first add up the previous hidden state with the embedding transformed by self.i_h and only then multiply the result by self.h_h you would learn a function that modifies the hidden state based on previous hidden state and embedding of the current time step.

Maybe it doesn’t matter much in practice which way is applied. Did you observe some difference in model performance?

The model performance is similar in both cases.