Hey guys. I struggle to find any good resources about text preprocessing/cleaning before modelling. If you know some, please share. I was thinking about:
fixing errors in text
extracting smiles
cleaning garbage
advices what to do with special structures: ip, dates, address
I only had a stab at some simple things (like this RNN model on presidential speeches) but generally I would throw a lot of regex at it and then let something like spacy do the trick (for word level data).
Probably more info might be helpful here - what are you trying to build, what you have as inputs, etc.
Would be cool to hear from someone doing this in a professional setting But guessing there might not be an easy rule of thumb to follow here and everything will be situation and data specific.
Fixing typos is quite a unique requirement and I am not sure it is commonly encountered. I am also not familiar with any tool that would be able to do this for you but my NLP knowledge is limited.
You could cook something up where you check the words against a dictionary and maybe map them to correct words if the count of unique mispellings is low. Not a great solution. I know ruby libraries often suggest you how something should be spelled if you get an error so maybe this is a direction worth exploring.
My best bet - and this is where I would start - would be to mark words that occur less then n times as unknown. Torchtext supports this and so does fastai.