(Tyler Morgan) #1

I set up a tabular learner with with 30k samples, 20k of which I set aside for validation, giving me 10k for training. When I fit the model and return the training predictions I get a list with two tensors both with 9984 elements, instead of 10k as I’d expect. When I run it for the validation set I get 20k elements.

I’m using the following method calls to get the predictions for training and validation respectively:
learn.get_preds(ds_type=DatasetType.Train) learn.get_preds(ds_type=DatasetType.Valid)

I’m not sure what’s going on. I’m assuming that the learner is dropping rows for some reason – maybe as part of the preprocessing? I’m hoping that someone can lead me towards why this happening.

The training dataloader drops the last batch if it doesn’t have bs elements. That’s because small batches lead to instability in training, particularly with BatchNorm. With a batch of size 1 we would even get an error from pytorch.

If you want all your training data, you can ask DataSet.Fix, which is the training set with shuffle=False and drop_last=False, ideal for evaluation.

(Tyler Morgan) #3

Thanks for the knowledge! I’ll give this a shot.

(Tyler Morgan) #4

Here’s some clarification for anyone else coming alter – DataSet.Fix is the training set and it already has shuffle=False and drop_last=False. I read the original comment incorrectly as I needed to set run DataSet.Fix and set shuffle=False and drop_last=False somewhere. Just run get_preds(ds_type=DatasetType.Fix) on your learner and you’re good to go!

TIL trying to compare unshuffled actuals to shuffled predictions gives you some really terrible results:sweat_smile:

(Avi A) #5

Thanks! This is exactly what I needed as well. Does this work predict only on the data in the training set but also on the data in the valid/test sets?


The validation and test set aren’t shuffled.

(Avi A) #7

yes, that makes a lot of sense. Thanks Sylvain!