TextLMDataBunch.from_csv very slow

data = TextLMDataBunch.from_csv(OUT_PATH,‘model/tokenized_single.csv’, valid_pct=0.2)

I have this tokenized_single.csv file with 12 million entries :smile:
This is 10% of my data that I need to train.

This TextLMDataBunch is super slow.
Is there anything that I can do to speed it up?

1 Like

If it’s already tokenized, you should just pass a NumericalizeProcessor (by default it uses a TokenizeProcessor and a NumericalizeProcessor in a row).