TextLMDataBunch.from_csv very slow

gerardo · February 26, 2019, 2:41am

data = TextLMDataBunch.from_csv(OUT_PATH,‘model/tokenized_single.csv’, valid_pct=0.2)

I have this tokenized_single.csv file with 12 million entries
This is 10% of my data that I need to train.

This TextLMDataBunch is super slow.
Is there anything that I can do to speed it up?

sgugger · February 26, 2019, 3:07pm

If it’s already tokenized, you should just pass a NumericalizeProcessor (by default it uses a TokenizeProcessor and a NumericalizeProcessor in a row).