@jeremy Continuing our conversation from https://github.com/fastai/fastai/pull/970
Summary:
For certain unlucky combinations of batch size (bs) and the length of the training set it can happen that training fails with the exception shown in this notebook https://github.com/benediktbrandt/course-v3/blob/master/nbs/dl1/lesson1-pets-Copy2.ipynb (the notebook reproduces the error reliably, just make sure you run it from start to finish after restarting the kernel). The cause of the error is a batch of size 1 being fed into the network which doesn’t work with the batchnorm layers.
Example:
The training dataset has 65 elements and bs is set to 64. In that case one batch will be fed into the network with size 64 and the next (last) batch will be fed with size 1.
This error can appear (almost) randomly since the dataset is randomly split into train and validation set and the number in each set isn’t necessarily the same between subsequent instantiations of ImageDataBunch. However one is more likely to run into this error when one uses small batch sizes.
Steps towards a solution:
This pull request https://github.com/fastai/fastai/pull/970 fixes this issue by removing the last element from the training set if the combination of bs and length of the training set would result in a bath of size 1. Of course removing a training sample isn’t necessarily the best solution however other solutions that I considered also had downsides. In particular I considered:
-
Passing bs into the
random_split
method and rerunning the random split untillen(is_train) % bs != 1
. This has the downside of being a brute force method and bs isn’t even available whenrandom_split
is called (the default bs is set in ImageDataBunch.create. One could specify the default value for the bs earlier (e.g. in thefrom_lists
method, but there are other methods that are calling ImageDataBunch.create and the default value of bs wouldn’t be defined in a central location anymore. One could make bs a class variable, but the current library design seems to minimize the use of class variables whenever possible. -
One could be more strategic about doing the train and validation split. I.e. as a first step partition an array of len(arrs[0]) such that the train size satisfies certain criteria (at the very least bs > 1, but maybe even bs > some number N). But this again raises the need of changing where the default value of bs is set (see above).
@jeremy I know you also requested some tests to show the issue. If you want I can write some, but that would require me to add a new dataset to the test datasets (since the mnist_tiny dataset already has a train and validation folder). Thoughts?