I have a question about modus operandi of ULMFiT. Imagine we have a sentence “I want to learn”. First of all we get our words embedded and have a matrix of shape (4, 400). There are 3 LSTM cells - do I understand it right that on the first step the word “I” is preprocessed through the first LSTM, get hidden state h_1_1 (first word first LSTM cell), then the other words are preprocessed through the first LSTM, and after we get 4 tensors of shape (1x1150), the same way we consecutively feed each of them into the next LSTM cell and so on? Or only the last hidden state of LSTM is fed to the next?

And at the end of the third LSTM we take only the last hidden state (it is obvious when we have LM model, but not quite apparent when we have our classifier - does max-pooled and mean-pooled vectors indeed perform better then `n`

hidden states?

Thank you in advance!