I’ve got a data set where each example has raw text + metadata. The metadata is always in English, and the raw text is usually in English — but not always. I don’t want to build separate classifiers for other languages than English; ignoring the text data for the non-English cases is all right. However, the non-English texts might contain useful keywords in English, strewn in between, thus I don’t want to run a language detector and delete raw texts that are not English.
As a result, some examples consists mostly of
xxunk's. How do I remove the
xxunks from the
TextDataBunches? Perhaps I should only remove them from a sample if the proportion of
xxunks to any other token exceeds some threshold.