Multi-task fine tuning for ULM-FiT

Any anyone experimented with multi-task fine tuning for ULM-FiT? It uses 2 separate heads (a LM head and classifier head) and minimizes loss for both task simultaneously. Apparently it improves the result over just training a classifier head, but most results I’ve seen are using transformers, so I’m curious if anyone has tried this with AWD-LSTM.

The paper I found is found here https://arxiv.org/pdf/1905.05583.pdf