How to know what to clean in NLP?

How do you know what to clean when cleaning text for NLP?

In the toxic competition, I’ve noticed a bunch of kernels with different “cleaning” methods which remove various unwanted characters/expressions from the text before training.

My question is, How do you figure out what should be removed and what should stay? Every corpus is different and so I’m curious to know what strategies folks employ to put the text in a way that makes it most beneficial for learning.

1 Like

Also what all information can we extract from HTML tags?(if any)