NLP preprocessing

Hey guys. I struggle to find any good resources about text preprocessing/cleaning before modelling. If you know some, please share. I was thinking about:

  • fixing errors in text
  • extracting smiles
  • cleaning garbage
  • advices what to do with special structures: ip, dates, address
  • + anything else I have no idea about
1 Like

I guess it depends a bit on what you want to do.

I only had a stab at some simple things (like this RNN model on presidential speeches) but generally I would throw a lot of regex at it and then let something like spacy do the trick (for word level data).

Probably more info might be helpful here - what are you trying to build, what you have as inputs, etc.

Would be cool to hear from someone doing this in a professional setting :wink: But guessing there might not be an easy rule of thumb to follow here and everything will be situation and data specific.

I can say only that it is smth on Kaggle, can`t be more specific. I hopped that something like fixing typos should not be task specific.

Fixing typos is quite a unique requirement and I am not sure it is commonly encountered. I am also not familiar with any tool that would be able to do this for you but my NLP knowledge is limited.

You could cook something up where you check the words against a dictionary and maybe map them to correct words if the count of unique mispellings is low. Not a great solution. I know ruby libraries often suggest you how something should be spelled if you get an error so maybe this is a direction worth exploring.

My best bet - and this is where I would start - would be to mark words that occur less then n times as unknown. Torchtext supports this and so does fastai.

1 Like

It looks like fastText might be useful to deal with typos: it just has vectors for words with typos. So when you build your embedding matrix, you miss less pre-trained vectors. https://www.kaggle.com/mschumacher/using-fasttext-models-for-robust-embeddings

1 Like

I don’t know how practical for your use case, but scraping google for ‘Did you mean …’ alternatives is a clever idea. https://www.kaggle.com/steubk/fixing-typos

1 Like

You may want to join this study group. The first online meeting will talk about NLP prepocessing. Let’s learn deeper. :slight_smile:

1 Like