ULMFit: understanding learner.predict

I am using ULMfit for a regression task. After finetuning, and training the classifier, I want to make predictions on new data (that has been preprocessed the same way as the training data.)

I am trying to figure out which function to use:
I see: learn.predict, learn.predict_array, learn.predict_dl, learn.predict_with_targs.

Seems like most of these works with the dataset used to init RNN_Learner().
However, the size of the predicted array (from learn.predict) is larger than the size of the dataset used above.

preds = learn.predict()
len(preds) # 1158004
len(unsup_labels) # 27007

Why is that the case?

2 Likes

Firstly, you need to preprocess your test data like validation set (eg use SortSampler, not SortishSampler).
Secondly, to get predictions on test dataset you need to use is_test=True:

 learn.predict(is_test=True)
2 Likes

Thanks for the response @asotov!
I am still getting a number of predictions that is different than the test dataloader.

Did I have to have the test data loader when I fit the classifier? Because I am just loading parts of the model later.

unsup_ds = TextDataset(unsup_clas, np.zeros(len(unsup_labels)))
unsup_samp = SortSampler(unsup_clas, key=lambda x: len(unsup_clas[x]))
unsup_dl = DataLoader(unsup_ds, bs, num_workers=1, pad_idx=1, sampler=unsup_samp)

len(unsup_clas) # 27007
len(unsup_dl.dataset.x) # 27007

md = ModelData(PATH, trn_dl, val_dl, test_dl=unsup_dl)
m = get_rnn_classifer(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
          layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
          dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])
learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=25.
learn.load_encoder('lm1_enc')
learn.load('clas_2')

preds = learn.predict(is_test=True)
len(preds) # 816067

I read carefully your code, @leonyin, and I think you just forgot to include transpose parameter, when you create DataLoader:

unsup_dl = DataLoader(unsup_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=unsup_samp)

Here my code, when I use test_dl:

trn_ds = TextDataset(trn_clas, np.array(trn_labels).astype('int64').reshape(-1))
val_ds = TextDataset(val_clas, np.array(val_labels).astype('int64').reshape(-1))
y = np.zeros((len(test_clas)), dtype='int')
test_ds = TextDataset(test_clas, y)

trn_samp = SortishSampler(trn_clas, key=lambda x: len(trn_clas[x]), bs=bs//2)
val_samp = SortSampler(val_clas, key=lambda x: len(val_clas[x]))
test_samp = SortSampler(test_clas, key=lambda x: len(test_clas[x]))

trn_dl = DataLoader(trn_ds, bs//2, transpose=True, num_workers=1, pad_idx=1, sampler=trn_samp)
val_dl = DataLoader(val_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=val_samp)
test_dl = DataLoader(test_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=test_samp)

md = ModelData(PATH, trn_dl,val_dl, test_dl)

2 Likes

Thanks again @asotov, that got me the right number of predictions.
It looks like the sortsampler re-orders the data. I am guessing I have to sort my labels, too if I want to align my predictions with labels (which is just an ID for me).

Glad to hear about it. @leonyin, I also have the same problem with wrong order of predictions, but I didn’t understand why it is. So, I think you guessed that we need to try predict without Sorting. Thank you!

In my case I see 95.73 accuracy during training, but when I call predict method and check accuracy manually - it gives only 63%! Now I understand the reason is SortSampler) I try to deal with it and write :memo: later here.

So, I just remove sampler and now everything worked as expected - rows are ordered as is!

test_dl = DataLoader(test_ds, bs, transpose=True, num_workers=1, pad_idx=1)
1 Like

I did the same and everything looks good! This might be something we want to incorporate into the imdb notebook cc:ing @jeremy. Again thanks for working this through with me @asotov!!

2 Likes