I am wanting to hear some ideas of how to use NLP on unlabelled data to gain insights on a corpus. I have a dataset that has a lot of poorly transcribed words and my thought is that I can use those poorly transcribed words to train a language model and then analyze the word embeddings. I think this will show me words that typically go together and hopefully will also help me clean the data. I haven’t tested this yet, but that’s my thought. I’m wondering, what else can I do with a large amount of unlabelled text that might be interesting to look at?
This seems like an interesting idea: https://arxiv.org/pdf/1904.09675
I could potentially take the sentences and run them through something like this and find sentences that mean the same thing maybe?
I found this, but have not read through it carefully yet. It seems to talk about text-similarity. but you could apply to different parts of your text and then find the things that are “grouped” in some way?