I have been playing around with Ulmfit a lot lately and still cannot wrap my head around how the language model’s ability to make sound predictions about the next word affects the classification of texts. I guess my real problem is that I do not understand what is happening at the low level of the network. So correct me if I am wrong but the procedure is like this right (?):
- The language model gets pre-trained and then fine-tuned. This part seems clear to me: Based on the current and preceding words you form probabilities about the next words.
- Then the model gets stripped from the softmax layer designed to create the probability distribution.
- You add the decoder consisting of a reLU-Layer (what is this layer actually doing?) and another softmax layer that outputs the probability of class membership of a given text document. So here are a lot of things I do not understand: How is the text document taken in and processed? Word for word I assume? So how do you end up with the prediction at the end? Is it averaged over all words?
Hmm you can see I am very confused. I hope you can help me understand Ulmfit better! Thanks in advance!