Get_preds returns smaller-than-expected tensor

Hi there,

I am experiencing an issue with prediction in fastai 1.0.52.

I have trained a text.text_classifier_learner. When I use get_predict on the train data, I get a tensor smaller than expected (less rows than in the train dataset).

# from fastai import text
pred_tens = learn_cls.get_preds(ds_type=text.DatasetType.Train)

In [17]: pred_tens[0].shape
Out[17]: torch.Size([1920, 29])

So the output is of size 1920. However, the expected size is 1952, as may be seen by looking at the data attached to my learner.

print(learn_cls.data)

TextClasDataBunch;

Train: LabelList (1952 items)
x: TextList
xxbos xxmaj hey there, some text
y: MultiCategoryList
CAT1,CAT2,...
Path: E:\myproject\data\full_samples\ulmfit;

Valid: LabelList (487 items)
x: TextList
xxbos hello . some text
y: MultiCategoryList
CAT1,CAT2,...
Path: E:\myproject\data\full_samples\ulmfit;

Test: None

Any idea regarding how I can troubleshoot this issue?
(I do not encounter this issue with the validation dataset. I have also added a test dataset at some point, and did not encounter any issue either)


(Appendix)
If it helps, here is more detail on how I constructed the classification databunch:

from fastai import text
#...
# At some point, I get a databunch data_lm, which contains the vocab used in the
# encoder I will transfer in my model
processor = [text.TokenizeProcessor(tokenizer=tokenizer), 
             text.NumericalizeProcessor()]
data_cls = text.TextList.from_df(df=df, cols="textcol", processor=4,
    vocab=data_lm.vocab, path="E:/myproject/data/full_samples/ulmfit")
data_cls = data_cls.split_by_rand_pct(valid_pct=.2)
data_cls = data_cls.label_from_df(cols="labelcol")
data_cls = data_cls.databunch(bs=64, num_workers=0)

Also, something interesting happens when I ask for ordered predictions (ordered=True), aka

pred_tens = learn_cls.get_preds(ds_type=text.DatasetType.Train, ordered=True)

# long error log, which ends with
~\AppData\Local\Continuum\anaconda3\envs\caap\lib\site-packages\fastai\text\learner.py in <listcomp>(.0)
     89             sampler = [i for i in self.dl(ds_type).sampler]
     90             reverse_sampler = np.argsort(sampler)
---> 91             preds = [p[reverse_sampler] for p in preds]
     92         return(preds)
     93

IndexError: index 1921 is out of bounds for dimension 0 with size 1920

Using python’s debugger, I saw that reverse_sampler is as expected of shape (1952,).

The training set has drop_last=True which means we discard the last batch if it’s not of size batch_size. Why? It leads to instability in training when you have batchnorm layers and even errors if you get a batch of size 1.

That’s why do get your predictions on the training set you should use DatasetType.Fix. That’s the same as train but without shuffle and drop_last. Also, in text, you’ll need to add ordered=True to your call to get_preds, because the samples are sorted by length and you want your preds in the order of the dataset.

3 Likes

Thank you very much for your thorough reply, these explanations are very helpful :slight_smile: