"IndexError: list index out of range" when training a text classifier and dataset is >= 750,000 examples

When I run this code below with a sample subset of the data (500k examples) it works fine … but when I run it against the full dataset (train_df has 1,306,122 records) I get an IndexError: list index out of range exception right as fitting concludes, and right before the validation set is processed.

data_clas = (TextList
             .from_df(train_df, path=path, cols=['question_text'], processor=txt_proc)
             .random_split_by_pct(valid_pct=.1)
             .label_from_df(cols=['target'])
             .add_test(TextList.from_df(test_df, path, cols=['question_text']))
             .databunch(bs=50)
            )

learn = text_classifier_learner(data_clas, drop_mult=0.5)
learn.load_encoder('lm-fine_tuned_enc')

learn.fit_one_cycle(1, 3e-2, moms=(0.8,0.7))

Here’s some of the stack trace …

Curious as to why this might work on a subset of the training examples, but blows up whenever I try to use anything greater ~ 750k examples.

I think there may be a bug with TextClasDataBunch.create()

When I run this code:

data_clas = (TextList
             .from_df(train_df, path=path, cols=['question_text'], processor=txt_proc)
             .random_split_by_pct()
             .label_from_df(cols=['target'])
             .add_test(TextList.from_df(test_df, path, cols=['question_text']))
             .databunch(bs=50)
            )
b = next(iter(data_clas.valid_dl))
b[0]

… I get the “IndexError”. But, I don’t get that error when trying to iterate over train_dl or test_dl.

If I take out the .add_test(...) line, then I’m able to iterate over both train_dl and valid_dl.

Am I doing something wrong with the data block API, or is there indeed something wrong with the TextClasDataBunch.create() and maybe also the TextClasDataBunch.load() methods?

Any luck with this? I tired to predict the test set with

pred, _, _, learn.predict(test.question_text[i] in a for loop but didn’t great results

pred,y,z = learn.predict(test.question_text[i]) works fine for me. Although, its very slow. Here “Z” holds the probabilities of the labels

There is a PR request in there now from another contributor.

This is a particularly nasty bug. Hopefully the fix will be incorporated into a release soon.