NLP preprocessing

sermakarevich · February 15, 2018, 5:31pm

Hey guys. I struggle to find any good resources about text preprocessing/cleaning before modelling. If you know some, please share. I was thinking about:

fixing errors in text
extracting smiles
cleaning garbage
advices what to do with special structures: ip, dates, address
+ anything else I have no idea about

radek · February 15, 2018, 5:51pm

I guess it depends a bit on what you want to do.

I only had a stab at some simple things (like this RNN model on presidential speeches) but generally I would throw a lot of regex at it and then let something like spacy do the trick (for word level data).

Probably more info might be helpful here - what are you trying to build, what you have as inputs, etc.

Would be cool to hear from someone doing this in a professional setting But guessing there might not be an easy rule of thumb to follow here and everything will be situation and data specific.

sermakarevich · February 15, 2018, 5:55pm

I can say only that it is smth on Kaggle, can`t be more specific. I hopped that something like fixing typos should not be task specific.

radek · February 15, 2018, 9:34pm

Fixing typos is quite a unique requirement and I am not sure it is commonly encountered. I am also not familiar with any tool that would be able to do this for you but my NLP knowledge is limited.

You could cook something up where you check the words against a dictionary and maybe map them to correct words if the count of unique mispellings is low. Not a great solution. I know ruby libraries often suggest you how something should be spelled if you get an error so maybe this is a direction worth exploring.

My best bet - and this is where I would start - would be to mark words that occur less then n times as unknown. Torchtext supports this and so does fastai.

sermakarevich · February 16, 2018, 9:37am

It looks like fastText might be useful to deal with typos: it just has vectors for words with typos. So when you build your embedding matrix, you miss less pre-trained vectors. https://www.kaggle.com/mschumacher/using-fasttext-models-for-robust-embeddings

digitalspecialists · February 16, 2018, 3:49pm

I don’t know how practical for your use case, but scraping google for ‘Did you mean …’ alternatives is a clever idea. https://www.kaggle.com/steubk/fixing-typos

Moody · February 17, 2018, 12:48am

You may want to join this study group. The first online meeting will talk about NLP prepocessing. Let’s learn deeper.