Problem with creating TextList: Too many xx-tokens

BelAir · November 7, 2019, 5:39pm

I have dataframe of shape (150000, 1) with a single column called “comment_text” which contains text-fragments.
I’m trying to create a TextList with the following code:
(TextList.from_csv(path, ‘train.csv’)
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=48))
, but when looking at the resulting databunch every single row looks like this: a number (idx) followed by something like this: ‘xxbos 41 xxbos 42 xxbos 43 xxbos 44 xxbos 45 xxbos 47 xxbos 48 xxbos 49 xxbos 50 xxbos 51 xxbos 52 xxbos 53’, i.e. only xx-tokens and numbers.
When checking len(data_lm.vocab.itos), I get only a very small number (104), which is no way reflects the number of words in the texts in my dataframe. When looking inside with data_lm.vocab.itos[:], I see only numbers ‘99’, & xx-tokens ‘xxeos’.
What is happening here. Any idea why it doesn’t work?

When I only run
TextList.from_csv(path, ‘train.csv’)
show_batch() shows readable text, though data_lm.vocab.itos() is still only 104.

As a sanity-check I also tried converting only a small portion of the dataframe (10 rows), but this still gave me the same result.

muellerzr · November 7, 2019, 5:42pm

You can pass in a text_cols (or something very similar) and pass in a 1 (or the name of the column) to grab it in your TextList call

BelAir · November 7, 2019, 5:45pm

Hi, thank you for the quick response. In fact, I tried passing a ‘text_cols’ parameter to TextList.from_csv() earlier, but this throws the following error:
init() got an unexpected keyword argument ‘text_cols’
Is this a bug in the library?

muellerzr · November 7, 2019, 5:57pm

Can you write the whole thing for me you tried?

BelAir · November 7, 2019, 6:55pm

I tried
(TextList.from_csv(path, ‘train.csv’, text_cols=‘comment_text’)
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=48))
which threw the error
init () got an unexpected keyword argument ‘text_cols’

I am now working around the problem with
data_lm = TextLMDataBunch.from_csv(path, ‘train.csv’)
which works straight out of the box. Still interested in what went wrong above though.