I have a dataset of verbatim comments from an annual survey implemented for the past 15 years and I want to perform some kind of NLP over it to identify things like:
What is the relative sentiment of the text? How positive or negative?
Does the text include a person’s name?
What is the nature of the comment? Is it a suggestion, a complaint, nonsense/spam, a threat, etc…? This is probably more of an unsupervised problem that can be solved with embeddings.
The problem is. None of the data is labeled!
For point 1, we could always get interns to go through and label data … but I was wondering about alternatives. For #2 I was thinking of just creating a bunch of random comments WITH names inserted in them.