TextList disable shuffle on test set

mnpinto · December 27, 2018, 8:23pm

I’m trying to make predictions to a test set for a text classification problem but when I call x, y = next(iter(learn.data.test_dl)) I find that the data is shuffled (the sentences in the first batch don’t correspond to the sentences in the top of the csv file).

This is how I’m creating the learner:

test = TextList.from_csv(path, 'test.csv', cols='text')

data = (TextList.from_df(train, path, cols='text')
                .random_split_by_pct(0.2)
                .label_from_df(cols=2)
                .add_test(test)
                .databunch(path='.'))

learn = text_classifier_learner(data, drop_mult=0.5)

Looking to the source code, if I’m not missing anything, I see that the SortSampler is applied to valid and test sets, sorting the data by length.

So my question is how to disable the sorting on the test set or how to recover the sort indices to sort back the predictions to the original order in the csv. I guess it should exist an easy way of doing it that I’m missing.

Thanks in advance!

yang-zhang · February 18, 2019, 9:01pm

I think this should work preds_test, _ = learn.get_preds(ds_type=DatasetType.Test, ordered=True) and this is how the indices are recovered: https://github.com/fastai/fastai/blob/master/fastai/text/learner.py#L84-L85