I was going through the classes for LMModel and had few questions regarding the initialization of the hidden states .

```
## Simplified language model
class LMModel4(Module):
def __init__(self , vocab_sz , n_hidden):
self.i_h = nn.Embedding(vocab_sz , n_hidden)
self.h_h = nn.Linear(n_hidden , n_hidden)
self.h_o = nn.Linear(n_hidden , vocab_sz)
self.h = 0
def forward(self,input_x):
# seq_length = 3
outs = []
## Repeat for 1->n-1
for i in range(seq_length):
h = h + self.i_h(x[:,i])
h = self.h_h(h) # (bs,n_hidden) -> (bs,n_hidden)
h = F.relu(h) # Activation Function
outs.append(self.h_o(self.h)) ## Appending outputs at each layer
# Note : We are Detaching gradients from h only after the computation of output , Which means that for this current iteration of out , gradients will still be calculated for the for seq_length layers .
self.h = self.h.detach() ## Remove all the gradients associated with h and just keeps the value (Sort of acting like a fixed bias) after each seq_length
return torch.stack(outs,dim=1)
def reset(self):
self.h = 0
```

and

```
## Multilayerd language model
class LMModel5(Module):
def __init__(self , vocab_sz , n_hidden, n_layers):
self.i_h = nn.Embedding(vocab_sz , n_hidden)
self.rnn = nn.RNN(n_hidden , n_hidden , n_layers , batch_first = True)
self.h_o = nn.Linear(n_hidden , vocab_sz)
self.h = torch.zeros(n_layers , bs , n_hidden)
def forward(self,input_x):
res,h = self.rnn(self.i_h(x) , self.h)
self.h = h.detach()
return self.h_o(res)
def reset(self):
self.h = 0
```

In LMModel4 , i understand the self.h is being set to 0 after every epoch (using callback) but within seq_length , the values are being used by the next layer .

In LMModel5 , i see self.h is defined with shape (n_layers , bs , n_hidden) . Does this mean values of each hidden_state for each layer and batch is stored here ?

Why is this needed ? Can’t we just use it like how we’ve been using it in LMModel4 by simply assigning h = 0 (with shape 1) . Is it because of the way nn.RNN functions ?