Self supervised learning on speech data


Does anyone has any experience with training models on self-supervised tasks on speech data. The objective is to learn representations of speech data which turn out to be helpful on downstream tasks like speaker identification and speech emotion classification.

We can already see successful and useful examples of this approach in NLP (ULMFit - training a model to predict next word in a sequence).
@jeremy listed down some self-supervised tasks in computer vision here:

Does anyone have some ideas or intuitions for such pretext tasks on speech data?

