NLP tokenizer returns odd values

Waterpas · March 16, 2020, 9:29pm

Hello,

I have recently finished the first part of the fastai course and wanted to create a NLP using the ULMFiT model as explained in lesson 3 and 4. I have a (pretty large) set of reviews that have a text and a rating, and I want to create a TextLMDataBunch from the text to train the language model learner on. I run the following code:

 bs=64
 data_lm = (TextLMDataBunch.from_csv(path, 'valid.csv', text_cols='text')
            .split_by_rand_pct(0.1)
           .label_for_lm()
           .databunch(bs=bs))

after which I call data_lm.show_batch(). However this returns many numbers and xxunk tokens instead of the text that should be returned. Does anyone know why this is the case?

Waterpas · April 9, 2020, 3:47pm

Fixed this, as TextLMDataBunch already creates a databunch. Either replace it with TextList or just create a databunch without the split_by_rand_pct etc.