Removing Stopwords hurt Deep Learning model performance

utsav · April 26, 2020, 10:11am

I wanted to know why does it hurt if we exclude stopwords in Deep Learning Model. If we remove from train and test data then it should not hurt right?

poppingtonic · April 26, 2020, 10:36am

Stopword removal is only useful for tasks such as searching, filtering or correcting documents based on a pattern input by the user. In this case these words act like noise. For natural language processing in deep learning, we’re trying to learn how the language works from the entire dataset as a sequence, and that includes keeping track of how each word connects to its surrounding words in an unstructured manner. Training the model only on what the stopword remover leaves behind deletes very important contextual information about how the language itself works. It makes the problem harder, and in a way that your model can not learn from.

utsav · April 26, 2020, 10:44am

Thanks for your response Brian. Are there any NLP tasks in which we definitely should not remove stop words?

poppingtonic · April 26, 2020, 12:45pm

Any neural network-based task will perform worse if you remove stopwords. TF-IDF, vector-counts and other similarity measures will perform better. Here’s more: https://towardsdatascience.com/why-you-should-avoid-removing-stopwords-aa7a353d2a52