'get_preds' doesn't process incomplete last batch on training data

gene · January 18, 2019, 7:12pm

Hey Fasties,

Trying to figure out whether it’s a bug or something I missed. I’m solving a collab filtering problem using ‘CollabDataBunch’ and ‘collab_learner’.

When I run ‘get_preds’ on the validation dataset and check the predictions I get a number of predictions that is consistent with the data being fed in:
data.valid_ds.x.codes.T[0].size, data.valid_ds.x.codes.T[1].size, len(preds), len(targets)
>>> (191, 191, 191, 191)

However, when I run ‘get_preds’ on the training dataset, the last batch (which is smaller than my batch_size) in the input isn’t getting predictions. Here’s what I do:
preds, targets = learn.get_preds(ds_type=DatasetType.Train)
data.train_ds.x.codes.T[0].size, data.train_ds.x.codes.T[1].size, len(preds), len(targets)
>>> (1725, 1725, 1664, 1664)

What’s happening is that my ‘batch_size’ is 64 but the last batch is 61 (1725 - 1664) and it doesn’t get predictions generated for it. I tested with ‘batch_size=32’ and got the expected result (29 predictions are missing):
(1725, 1725, 1696, 1696)

Did anyone else encounter this problem? Is there a solution that I’m missing, or is it likely a bug?

Kaspar · January 18, 2019, 8:33pm

I you constructed the learner with TextLMDataBunch as dataset then the default is to leave out the last part if it doesn’t match the batchsize. I believe that you can give the argument drop_last=False to the constructor TextLMDataBunch

gene · January 18, 2019, 11:01pm

@Kaspar Amazing! Thank you! Your answer led me to find the solution on the forums: `RNNLearner.get_preds(DatasetType.Train, ordered=True)` does not work for `TextClasDataBunch`

bfarzin · January 24, 2019, 5:34pm

I seem to be getting an even worse outcome. If I set to DatasetType.Train I miss the last batch as well, but I also get the data out of order. @gene do you get that problem also? Setting it to DatasetType.Fix resolves both problems, so I am using that not but that is not intuitive in this context. (like you I got it from the forums)

gene · January 25, 2019, 5:12am

@bfarzin, I found a better solution (at least for my use case) where you have two options depending on your needs.

Option 1 is to pass the same dataset for training and test to collab_learner e.g. learn = collab_learner(data, test=data, n_factors=50, pct_val=0.1) and calling preds, _ = learn.get_preds(ds_type=DatasetType.Test) after training. More details here.

Option 2 is to export the learner using learn.export() and load your learner with the training dataset as the test dataset learn = load_learner(path, test=data). Then you can again use preds, _ = learn.get_preds(ds_type=DatasetType.Test). More details here.

Important: it won’t work with fastai version below 1.0.40 so you might need to update. Hope it’s useful.