Hidden states in RNNs/LSTMs

Hi everyone.

In the 10_nlp notebook we batch the texts for our language model. Before that, we concatenate all the texts together (referring to the “Putting Our Texts into Batches for a Language Model” section). When should we zero our hidden states? In the 12_nlp_dive notebook it happens after each epoch. But would it be better if we do it after each sentence in each mini-batch? Here is an example, I will use bs=2 and seq_len=5.
First batch:
[bos, 1, 2, 3, 4]
[a, b, c, eos, bos] ← do h=0 after eos
Second batch:
[5, eos, bos, 6, 7] ← do h=0 after eos
[d, e, f, g, eos] ← do h=0 after eos

Why don’t we zero hidden states after each eos token? These are independent sentences (talking about the movie reviews) anyway, what’s the point of keeping these hidden states all the way to the end? I just can’t find any explanations at all.

Here I’m only talking about LM (perhaps MLM as well).

Also, it makes sense zeroing hidden states after each independent review, because we will not encounter multiple reviews during evaluation anyway (if we generated eos we would stop).

Any help would be greatly appreciated.

1 Like

Hi Rob
The hidden states are meant to remember previous objects in the sentence.
The film, which my friend, who recommended it after reading the book which I originally recommended, said I would enjoyed it, was actually wrong, I hated it.

So learning some language complexity would not be possible with a mini-batch but might be possible with the epoch. So the hidden states are film review language style rather than the individual film reviews.

Regards Conwyn

1 Like

Thank you for your reply.

If hidden states remember previous objects, how does it affect a review after eos? Like what information these hidden states can transfer if all these reviews are independent anyway? There will be no information to transfer from review X to review X+1. Maybe our model learns in a way that after each eos token, our hidden states always be equal to zero somehow?

Are there any links that compare these two approaches? Or even a mathematical proof that this actually works? Because it is not obvious to me.

Once again, thank you for your reply.

Hi Rob
I think there are two things going on here. A language model predicts the next word from the previous words. The language model is then used as a classifier.So although a review of Bambi is different to a review of Star Wars then language is similar. So if we have word1 w2 w3 w4 = Bad Movie but w1 w2 w3 W5 - Good Movie. So we learn the language model with the hidden states to become the classifier.
Regards Conwyn

1 Like