Removing delimiters in text processing

dipam7 · August 20, 2020, 7:40pm

I am using the PubMed-RCT dataset for text classification. It contains a bunch of abstracts and a label. The abstracts are in a text file. Here’s a sample abstract.

As you can see, it has a lot of delimiters like ‘\n’, ‘\t’, and so on. Now, for text classification using ULMFiT or any other method, I know that we want to keep the data as raw as possible. My question is, are these delimiters useful for the neural net or can they be safely removed? (I will try this of course, but I just want to know if someone else has and what were the results)

muellerzr · August 20, 2020, 7:46pm

It’s common practice to remove anything with escape characters in it as contextually it doesn’t provide much value from what I’ve read (both from a DL and a fastai perspective)

morgan · August 20, 2020, 8:00pm

Agree with @muellerzr and (if you don’t know already) I think the default fastai text processing rules will do the job for you, “preprocessing rules” here : http://dev.fast.ai/text.core