Remove xxup from special cases in TextList.from_df

yoshy · April 24, 2020, 12:51am

I am trying to stop getting xxup in my text output. Based on other posts, I thought the following code should work.

tokenizer = Tokenizer(SpacyTokenizer, special_cases = [‘UNK’,‘PAD’,‘BOS’,‘FLD’,‘TK_MAJ’,‘TK_REP’,‘TK_WREP’])
processors = [TokenizeProcessor(tokenizer=tokenizer),
NumericalizeProcessor()]
data_lm = TextList.from_df(vv[[‘DX’]], processor=processors).split_by_rand_pct(0.1).label_for_lm().databunch(bs=bs, bptt=bptt)
data_lm.show_batch()

However, the output from show_batch includes (just an example)

123 xxup x4346 xxup foo

Capitals are irrelevant to my text data and model. How do I stop xxup?