Bert losses applied to other architectures?

mboyanov · May 2, 2019, 6:15am

The main contributions of the BERT paper are the new losses proposed - the masked language model and next sentence prediction. Does anyone know if anyone has tried applying the same losses to another architecture, e.g. a BiLSTM?