Labeling of NLP data, entire docs, paragraphs or sentences?

Moezzie · May 11, 2020, 9:23am

Thank you Stefan-ai for your very thorough explanation.

The points you are making sound very plausible and thought through.

As you say, intuitively labelling on a document level sounds more appropriate, since it encompasses the entire document as a whole, not just sentence by sentences in isolation.
I actually started out that way but ran into the problem of many documents containing equal parts positive and negative sentences.

What do you think about including both sentences and the entire documents in the dataset.
I mean, feeding the model both pure word to word relationships (sentences) and the relationship between sentences (documents).
Could this skew the results in any way.
Does a positively labeled sentence included in a negatively labeled document cancel each other out?

Regarding classes.
Imagine removing the neutral class, only train using a binary positive or negative rating. How would this affect the model.

For example, the document “this is a car” is neither positive nor negative. I guess the model would be forced to classify it as one or the other since it doesn’t know about the “neutral” class.
Does that make sense?

Do we work with some kind of confidence score here to determine if the prediction is to be trusted?
A threshold below 85% might get categorised as being neutral.