Get_preds returning less results than length of original dataset


(Tyler Morgan) #1

I set up a tabular learner with with 30k samples, 20k of which I set aside for validation, giving me 10k for training. When I fit the model and return the training predictions I get a list with two tensors both with 9984 elements, instead of 10k as I’d expect. When I run it for the validation set I get 20k elements.

I’m using the following method calls to get the predictions for training and validation respectively:
learn.get_preds(ds_type=DatasetType.Train) learn.get_preds(ds_type=DatasetType.Valid)

I’m not sure what’s going on. I’m assuming that the learner is dropping rows for some reason – maybe as part of the preprocessing? I’m hoping that someone can lead me towards why this happening.

Any help would be appreciated.


#2

The training dataloader drops the last batch if it doesn’t have bs elements. That’s because small batches lead to instability in training, particularly with BatchNorm. With a batch of size 1 we would even get an error from pytorch.

If you want all your training data, you can ask DataSet.Fix, which is the training set with shuffle=False and drop_last=False, ideal for evaluation.


NLP prediction - predict batch
(Tyler Morgan) #3

Thanks for the knowledge! I’ll give this a shot.


(Tyler Morgan) #4

Here’s some clarification for anyone else coming alter – DataSet.Fix is the training set and it already has shuffle=False and drop_last=False. I read the original comment incorrectly as I needed to set run DataSet.Fix and set shuffle=False and drop_last=False somewhere. Just run get_preds(ds_type=DatasetType.Fix) on your learner and you’re good to go!

TIL trying to compare unshuffled actuals to shuffled predictions gives you some really terrible results:sweat_smile:


(Avi A) #5

Thanks! This is exactly what I needed as well. Does this work predict only on the data in the training set but also on the data in the valid/test sets?


#6

The validation and test set aren’t shuffled.


(Avi A) #7

yes, that makes a lot of sense. Thanks Sylvain!