LM pretraining on a large dataset for LM on a smaller dataset?

Let’s say I want to do unconditional text generation with LM on a small specialized dataset (e.g. texts collected from physiology papers). Since the dataset size is too small, I’d like to pretrain the LM with a large dataset like 1BLM. Has this kind of LM fine-tuning for the sake of LM on a smaller dataset been studied before? Could you list some papers? The recent fine-tuning papers such as ULMFiT and BERT have discussed the application to tasks other than LM for text generation as far as I’m aware.