Bert losses applied to other architectures?

The main contributions of the BERT paper are the new losses proposed - the masked language model and next sentence prediction. Does anyone know if anyone has tried applying the same losses to another architecture, e.g. a BiLSTM?