NLP and sentiment analysis without labels?

I have a dataset of verbatim comments from an annual survey implemented for the past 15 years and I want to perform some kind of NLP over it to identify things like:

  • What is the relative sentiment of the text? How positive or negative?

  • Does the text include a person’s name?

  • What is the nature of the comment? Is it a suggestion, a complaint, nonsense/spam, a threat, etc…? This is probably more of an unsupervised problem that can be solved with embeddings.

The problem is. None of the data is labeled!

For point 1, we could always get interns to go through and label data … but I was wondering about alternatives. For #2 I was thinking of just creating a bunch of random comments WITH names inserted in them.

Thoughts? Suggestions?


For the first problem, you can try training a sentiment classification model on a public data set where labels are available, and using it to generate predictions for your data. You won’t have an accuracy metric this way, though, and the model might perform poorly if the domains are very different. So it might be best to just spend a few days labeling randomly selected comments.

For the second problem (detecting names), I would consider using an NLP package such as spaCy to extract entities from the text. It’s not using deep learning, but it might be a simple solution for your needs. This can look as simple as:

import spacy
nlp = spacy.load('en')
def has_name(text):
    doc = nlp(my_text)
    for ent in doc.ents:
        if ent.ent_type_ == 'PERSON':
            return True # doc contains a person's name

Excellent recommendation.

Never heard of this library but for the problem of pulling verbatims containing individual names, this looks perfect!


@shawn - 3rd point about nature of the comment ? Is it a opinion,tone of the customer,suggestion, a complaint, nonsense/spam, a threat. How can we deal this and which library is best to deal without a label in the in customer service area.