I have dataframe of shape (150000, 1) with a single column called “comment_text” which contains text-fragments.
I’m trying to create a TextList with the following code:
(TextList.from_csv(path, ‘train.csv’)
.split_by_rand_pct(0.1)
.label_for_lm()
.databunch(bs=48))
, but when looking at the resulting databunch every single row looks like this: a number (idx) followed by something like this: ‘xxbos 41 xxbos 42 xxbos 43 xxbos 44 xxbos 45 xxbos 47 xxbos 48 xxbos 49 xxbos 50 xxbos 51 xxbos 52 xxbos 53’, i.e. only xx-tokens and numbers.
When checking len(data_lm.vocab.itos), I get only a very small number (104), which is no way reflects the number of words in the texts in my dataframe. When looking inside with data_lm.vocab.itos[:], I see only numbers ‘99’, & xx-tokens ‘xxeos’.
What is happening here. Any idea why it doesn’t work?
When I only run
TextList.from_csv(path, ‘train.csv’)
show_batch() shows readable text, though data_lm.vocab.itos() is still only 104.
As a sanity-check I also tried converting only a small portion of the dataframe (10 rows), but this still gave me the same result.