Text Cleaning for ULMFit

I have used ULMFit, for classification of tickets raised by users of a certain IT service. It is a multi label classification problem. The users are not native english speakers therefore there is a significant number of grammatical errors, as well as spelling mistakes. Addtionally there are emojis and special characters too. My question is, whether I should clean up the text by removing just the emojis and special characters OR I should be looking to correct spelling mistakes and perform lemmatization in the hope of achieving high classification accuracy? Thank you.

Are you fine-tuning the language model using all of your ticket data? Assuming that you are, and it isn’t giving you enough of a boost, how about fine-tuning first using a social media corpus (there are many Twitter datasets, for example) — these should contain more informal phrasing, spelling mistakes, and emojis than Wikitext will have included. Then you can fine-tune again on your own corpus of tickets. This might be a helpful strategy.

I don’t know about trying to correct spelling mistakes or deleting emojis. I would tend to think that if the spelling mistakes are regular enough for you to fix them with a rule, the neural network should be learning to deal with them anyway. But perhaps if they’re all just getting reduced to “xxunk” then maybe not. You should always experiment and see what works best.

Thanks for the suggestion. I shall see if the fine tuning strategy involving social media corpus yields better results.

I am not aware of a way to load a previously trained language model.

The pre-trained model is “tweet_lm” and I intend to use this for fine tuning on the ticket data.

You can use “pretrained_fnames” to load the weights and itos file (vocabulary) of the pretrained “tweet_lm” language model into your new model, an example is here.

Thanks a lot