Working through the statefarm notebooks and noticed that when creating batches, only the validation batches are created with shuffle=False:
batches = get_batches(path+'train', batch_size=batch_size)
val_batches = get_batches(path+'valid', batch_size=batch_size*2, shuffle=False)
Why is that?
If we are going to be saving the actual image array for both training and validation datasets, wouldn’t we want BOTH to have shuffle=False so that the labels, classes, and filenames match up when used later?
Yes you would! The notebooks aren’t always exactly the order of operations you need to follow - sometimes I jump around between cells a bit so you do need to think about what to run when.
Thanks Jeremy! It’s nice to get confirmation that my intuition about these things is getting better.
I fell into this trap as well.
Let me just state this explicitly in case someone is searching the forums: if you’re working through the statefarm.ipynb notebook and you are getting accuracies that approximate chance when building a model that uses pre-trained vgg layers up through the last Convolution2D layer as inputs this is very likely your problem.
This issue is also discussed (and answered) here:
Very happy I found this thread. I won’t make this mistake with a DirectoryIterator when tying two models together again.
What I tend to do now is process all the images and persist them as arrays (see the
get_data() method in Jeremy’s utils.py).
From there, you can create shuffled batches using
gen.flow() as needed or just use the arrays themselves for fitting/predictions and set shuffle=True.