I am using the PubMed-RCT dataset for text classification. It contains a bunch of abstracts and a label. The abstracts are in a text file. Here’s a sample abstract.
As you can see, it has a lot of delimiters like ‘\n’, ‘\t’, and so on. Now, for text classification using ULMFiT or any other method, I know that we want to keep the data as raw as possible. My question is, are these delimiters useful for the neural net or can they be safely removed? (I will try this of course, but I just want to know if someone else has and what were the results)