In statefarm, why are training batches NOT created with shuffle=False?

Working through the statefarm notebooks and noticed that when creating batches, only the validation batches are created with shuffle=False:

batches = get_batches(path+'train', batch_size=batch_size)
val_batches = get_batches(path+'valid', batch_size=batch_size*2, shuffle=False)

Why is that?

If we are going to be saving the actual image array for both training and validation datasets, wouldn’t we want BOTH to have shuffle=False so that the labels, classes, and filenames match up when used later?

1 Like

Yes you would! The notebooks aren’t always exactly the order of operations you need to follow - sometimes I jump around between cells a bit so you do need to think about what to run when.

Thanks Jeremy! It’s nice to get confirmation that my intuition about these things is getting better.

I fell into this trap as well.

Let me just state this explicitly in case someone is searching the forums: if you’re working through the statefarm.ipynb notebook and you are getting accuracies that approximate chance when building a model that uses pre-trained vgg layers up through the last Convolution2D layer as inputs this is very likely your problem.

This issue is also discussed (and answered) here:

Very happy I found this thread. I won’t make this mistake with a DirectoryIterator when tying two models together again.

What I tend to do now is process all the images and persist them as arrays (see the get_data() method in Jeremy’s utils.py).

From there, you can create shuffled batches using gen.flow() as needed or just use the arrays themselves for fitting/predictions and set shuffle=True.