In the 10_nlp notebook we batch the texts for our language model. Before that, we concatenate all the texts together (referring to the “Putting Our Texts into Batches for a Language Model” section). When should we zero our hidden states? In the 12_nlp_dive notebook it happens after each epoch. But would it be better if we do it after each sentence in each mini-batch? Here is an example, I will use bs=2 and seq_len=5.
[bos, 1, 2, 3, 4]
[a, b, c, eos, bos] ← do h=0 after eos
[5, eos, bos, 6, 7] ← do h=0 after eos
[d, e, f, g, eos] ← do h=0 after eos
Why don’t we zero hidden states after each eos token? These are independent sentences (talking about the movie reviews) anyway, what’s the point of keeping these hidden states all the way to the end? I just can’t find any explanations at all.
Here I’m only talking about LM (perhaps MLM as well).
Also, it makes sense zeroing hidden states after each independent review, because we will not encounter multiple reviews during evaluation anyway (if we generated eos we would stop).
Any help would be greatly appreciated.