Cleaning multi-lang data for text classification

I’ve got a data set where each example has raw text + metadata. The metadata is always in English, and the raw text is usually in English — but not always. I don’t want to build separate classifiers for other languages than English; ignoring the text data for the non-English cases is all right. However, the non-English texts might contain useful keywords in English, strewn in between, thus I don’t want to run a language detector and delete raw texts that are not English.

As a result, some examples consists mostly of xxunk's. How do I remove the xxunks from the TextDataBunches? Perhaps I should only remove them from a sample if the proportion of xxunks to any other token exceeds some threshold.