Get_preds returns smaller-than-expected tensor

Hi there,

I am experiencing an issue with prediction in fastai 1.0.52.

I have trained a text.text_classifier_learner. When I use get_predict on the train data, I get a tensor smaller than expected (less rows than in the train dataset).

# from fastai import text
pred_tens = learn_cls.get_preds(ds_type=text.DatasetType.Train)

In [17]: pred_tens[0].shape
Out[17]: torch.Size([1920, 29])

So the output is of size 1920. However, the expected size is 1952, as may be seen by looking at the data attached to my learner.



Train: LabelList (1952 items)
x: TextList
xxbos xxmaj hey there, some text
y: MultiCategoryList
Path: E:\myproject\data\full_samples\ulmfit;

Valid: LabelList (487 items)
x: TextList
xxbos hello . some text
y: MultiCategoryList
Path: E:\myproject\data\full_samples\ulmfit;

Test: None

Any idea regarding how I can troubleshoot this issue?
(I do not encounter this issue with the validation dataset. I have also added a test dataset at some point, and did not encounter any issue either)

If it helps, here is more detail on how I constructed the classification databunch:

from fastai import text
# At some point, I get a databunch data_lm, which contains the vocab used in the
# encoder I will transfer in my model
processor = [text.TokenizeProcessor(tokenizer=tokenizer), 
data_cls = text.TextList.from_df(df=df, cols="textcol", processor=4,
    vocab=data_lm.vocab, path="E:/myproject/data/full_samples/ulmfit")
data_cls = data_cls.split_by_rand_pct(valid_pct=.2)
data_cls = data_cls.label_from_df(cols="labelcol")
data_cls = data_cls.databunch(bs=64, num_workers=0)

Also, something interesting happens when I ask for ordered predictions (ordered=True), aka

pred_tens = learn_cls.get_preds(ds_type=text.DatasetType.Train, ordered=True)

# long error log, which ends with
~\AppData\Local\Continuum\anaconda3\envs\caap\lib\site-packages\fastai\text\ in <listcomp>(.0)
     89             sampler = [i for i in self.dl(ds_type).sampler]
     90             reverse_sampler = np.argsort(sampler)
---> 91             preds = [p[reverse_sampler] for p in preds]
     92         return(preds)

IndexError: index 1921 is out of bounds for dimension 0 with size 1920

Using python’s debugger, I saw that reverse_sampler is as expected of shape (1952,).

The training set has drop_last=True which means we discard the last batch if it’s not of size batch_size. Why? It leads to instability in training when you have batchnorm layers and even errors if you get a batch of size 1.

That’s why do get your predictions on the training set you should use DatasetType.Fix. That’s the same as train but without shuffle and drop_last. Also, in text, you’ll need to add ordered=True to your call to get_preds, because the samples are sorted by length and you want your preds in the order of the dataset.


Thank you very much for your thorough reply, these explanations are very helpful :slight_smile: