ULMFit: understanding learner.predict

I am using ULMfit for a regression task. After finetuning, and training the classifier, I want to make predictions on new data (that has been preprocessed the same way as the training data.)

I am trying to figure out which function to use:
I see: learn.predict, learn.predict_array, learn.predict_dl, learn.predict_with_targs.

Seems like most of these works with the dataset used to init RNN_Learner().
However, the size of the predicted array (from learn.predict) is larger than the size of the dataset used above.

preds = learn.predict()
len(preds) # 1158004
len(unsup_labels) # 27007

Why is that the case?


Firstly, you need to preprocess your test data like validation set (eg use SortSampler, not SortishSampler).
Secondly, to get predictions on test dataset you need to use is_test=True:


Thanks for the response @asotov!
I am still getting a number of predictions that is different than the test dataloader.

Did I have to have the test data loader when I fit the classifier? Because I am just loading parts of the model later.

unsup_ds = TextDataset(unsup_clas, np.zeros(len(unsup_labels)))
unsup_samp = SortSampler(unsup_clas, key=lambda x: len(unsup_clas[x]))
unsup_dl = DataLoader(unsup_ds, bs, num_workers=1, pad_idx=1, sampler=unsup_samp)

len(unsup_clas) # 27007
len(unsup_dl.dataset.x) # 27007

md = ModelData(PATH, trn_dl, val_dl, test_dl=unsup_dl)
m = get_rnn_classifer(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
          layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
          dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])
learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)

preds = learn.predict(is_test=True)
len(preds) # 816067

I read carefully your code, @leonyin, and I think you just forgot to include transpose parameter, when you create DataLoader:

unsup_dl = DataLoader(unsup_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=unsup_samp)

Here my code, when I use test_dl:

trn_ds = TextDataset(trn_clas, np.array(trn_labels).astype('int64').reshape(-1))
val_ds = TextDataset(val_clas, np.array(val_labels).astype('int64').reshape(-1))
y = np.zeros((len(test_clas)), dtype='int')
test_ds = TextDataset(test_clas, y)

trn_samp = SortishSampler(trn_clas, key=lambda x: len(trn_clas[x]), bs=bs//2)
val_samp = SortSampler(val_clas, key=lambda x: len(val_clas[x]))
test_samp = SortSampler(test_clas, key=lambda x: len(test_clas[x]))

trn_dl = DataLoader(trn_ds, bs//2, transpose=True, num_workers=1, pad_idx=1, sampler=trn_samp)
val_dl = DataLoader(val_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=val_samp)
test_dl = DataLoader(test_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=test_samp)

md = ModelData(PATH, trn_dl,val_dl, test_dl)


Thanks again @asotov, that got me the right number of predictions.
It looks like the sortsampler re-orders the data. I am guessing I have to sort my labels, too if I want to align my predictions with labels (which is just an ID for me).

Glad to hear about it. @leonyin, I also have the same problem with wrong order of predictions, but I didn’t understand why it is. So, I think you guessed that we need to try predict without Sorting. Thank you!

In my case I see 95.73 accuracy during training, but when I call predict method and check accuracy manually - it gives only 63%! Now I understand the reason is SortSampler) I try to deal with it and write :memo: later here.

So, I just remove sampler and now everything worked as expected - rows are ordered as is!

test_dl = DataLoader(test_ds, bs, transpose=True, num_workers=1, pad_idx=1)
1 Like

I did the same and everything looks good! This might be something we want to incorporate into the imdb notebook cc:ing @jeremy. Again thanks for working this through with me @asotov!!