What way do you guy’s feel is the best way to label documents for sentiment analysis? And what set of labels do i use? Positive, negative, and neutral. Or just positive and negative?
I have access to a huge number of social media posts (non english) though my work so I’ve labeled about 10k of them by hand to experiment with finetuning an ULMFIT model.
The problem I’m facing is how to split and label the documents.
I’ve broken each doc up into sentences and labeled each sentence by its self.
The result is about 90% neutral, 8% are negative, 2% positive sentences.
Labeling documents in their entirety though results in roughly 75% negative.
Finetuning the model with sentences results in about 99.9% of input being classified as neutral,
whereas when using entire documents almost every input doc is classified as negative.
What would you recommend?
I had a similar experience when dealing with user-generated text data.
The decision if you want to make your analysis based on sentences, paragraphs or documents really depends on what level you ultimately care about.
In my case I went for the document (or post) level because that reflects the overall sentiment of a user reflected in a post. However, posts that include both positive and negative sentences can quite easily confuse the model (and are sometimes not even easy to decide for the human annotator).
If you go for sentence based prediction, it can often be easier for the model because sentences won’t that often have a mixed sentiment. But it can also have some downsides, e.g. if two or more sentences belong together you miss that connection. Also, you don’t get the overall sentiment of the post as a whole, even though you could aggregate your sentence predictions, e.g. by counting how many pos or neg sentences it contains (certainly imperfect but maybe good enough).
Regarding the number of classes it depends on what output you want to generate. Definitely the easiest for the model is to keep the problem binary (pos and neg) and disregard all neutral sentences. That’s what basically was done to generate the IMDB dataset and therefore we get such high accuracies there. In practice you need to deal with neutral content though, so that’s naturally a harder problem to solve. I even went for 5 classes like the Yelp dataset, to get a more fine-grained output.
The distribution of classes in your dataset looks highly imbalanced, which usually is a problem for ML models. Unfortunately there is not much data augmentation possible in NLP. What I would do is to try to balance the dataset by selective labeling. Don’t spend more time on labeling neutral sentences since you clearly have enough already. Then you could try to resample your data in some way, e.g. reduce the number of neutrals and increase the number of positives and negatives by simply copying them a few times in your dataset (I think Jeremy mentioned that in one of his lectures).
Thank you Stefan-ai for your very thorough explanation.
The points you are making sound very plausible and thought through.
As you say, intuitively labelling on a document level sounds more appropriate, since it encompasses the entire document as a whole, not just sentence by sentences in isolation.
I actually started out that way but ran into the problem of many documents containing equal parts positive and negative sentences.
What do you think about including both sentences and the entire documents in the dataset.
I mean, feeding the model both pure word to word relationships (sentences) and the relationship between sentences (documents).
Could this skew the results in any way.
Does a positively labeled sentence included in a negatively labeled document cancel each other out?
Imagine removing the neutral class, only train using a binary positive or negative rating. How would this affect the model.
For example, the document “this is a car” is neither positive nor negative. I guess the model would be forced to classify it as one or the other since it doesn’t know about the “neutral” class.
Does that make sense?
Do we work with some kind of confidence score here to determine if the prediction is to be trusted?
A threshold below 85% might get categorised as being neutral.
I guess there are some ways to have a sentence-based model that also incorporates the information which sentences belong to the same document. But I don’t think it’s as simple as including both sentences and docs in the training dataset. One thing that comes to my mind is the Stanford Sentiment Treebank. They build up the sentiment of a sentence from the sentiment of individual words and combinations of words. Maybe something analogous can be done to build up the sentiment of docs from the sentiment of individual sentences. But I haven’t worked with that. If you come up with something, please let me know.
If you remove the neutral class from your model, it won’t be able to handle neutral texts during inference. It’s exactly as you say. The model will be forced to make a decision to classify a texts as either positive or negative. So if you need your model to classify texts as neutral, you’ll also need to include this class in your training set. I think a simple threshold would not work to consistently identify neutrals, but you can try it out and see what happens.