Ulmfit: How does the language model actually help with text categorization

Hello there,
I have been playing around with Ulmfit a lot lately and still cannot wrap my head around how the language model’s ability to make sound predictions about the next word affects the classification of texts. I guess my real problem is that I do not understand what is happening at the low level of the network. So correct me if I am wrong but the procedure is like this right (?):

  1. The language model gets pre-trained and then fine-tuned. This part seems clear to me: Based on the current and preceding words you form probabilities about the next words.
  2. Then the model gets stripped from the softmax layer designed to create the probability distribution.
  3. You add the decoder consisting of a reLU-Layer (what is this layer actually doing?) and another softmax layer that outputs the probability of class membership of a given text document. So here are a lot of things I do not understand: How is the text document taken in and processed? Word for word I assume? So how do you end up with the prediction at the end? Is it averaged over all words?

Hmm you can see I am very confused. I hope you can help me understand Ulmfit better! Thanks in advance!