Preliminary guess without looking at the source code (so what im saying next could be wrong).
In classification the text is pretty long, so you gotta do a MultiBatchEncoder, otherwise it doesn’t fit in memory (you’ll probably have to check the source code to further understand what this multibatchencoder does, but in short it runs a parts of the sentence batch by batch, connecting them through hidden states) . But in an LM things arent that long, in fact the data is cut into nice pieces depending on the bs and bptt and it just works.